[jira] [Resolved] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get

2020-03-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30896.
--
Resolution: Later

> The behavior of JsonToStructs should not depend on SQLConf.get
> --
>
> Key: SPARK-30896
> URL: https://issues.apache.org/jira/browse/SPARK-30896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get

2020-03-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056672#comment-17056672
 ] 

Hyukjin Kwon commented on SPARK-30896:
--

Yeah, let's don't fix it for now.

> The behavior of JsonToStructs should not depend on SQLConf.get
> --
>
> Key: SPARK-30896
> URL: https://issues.apache.org/jira/browse/SPARK-30896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31113) Support DDL "SHOW VIEWS"

2020-03-10 Thread Xin Wu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056669#comment-17056669
 ] 

Xin Wu commented on SPARK-31113:


Sure, I'm working on this! Thanks [~smilegator]

> Support DDL "SHOW VIEWS"
> 
>
> Key: SPARK-31113
> URL: https://issues.apache.org/jira/browse/SPARK-31113
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> It is nice to have a `SHOW VIEWS` command similar to Hive 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31070) make skew join split skewed partitions more evenly

2020-03-10 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-31070.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> make skew join split skewed partitions more evenly
> --
>
> Key: SPARK-31070
> URL: https://issues.apache.org/jira/browse/SPARK-31070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get

2020-03-10 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056636#comment-17056636
 ] 

Wenchen Fan commented on SPARK-30896:
-

More importantly, what should be the official way when we add new configs that 
can affect expression behavior? Shall we just store the config value in a `val` 
or put it in the expression constructor? also cc [~Gengliang.Wang]

> The behavior of JsonToStructs should not depend on SQLConf.get
> --
>
> Key: SPARK-30896
> URL: https://issues.apache.org/jira/browse/SPARK-30896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30895) The behavior of CsvToStructs should not depend on SQLConf.get

2020-03-10 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056634#comment-17056634
 ] 

Wenchen Fan commented on SPARK-30895:
-

See https://issues.apache.org/jira/browse/SPARK-30896

> The behavior of CsvToStructs should not depend on SQLConf.get
> -
>
> Key: SPARK-30895
> URL: https://issues.apache.org/jira/browse/SPARK-30895
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30895) The behavior of CsvToStructs should not depend on SQLConf.get

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30895:

Priority: Minor  (was: Major)

> The behavior of CsvToStructs should not depend on SQLConf.get
> -
>
> Key: SPARK-30895
> URL: https://issues.apache.org/jira/browse/SPARK-30895
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31099) Create migration script for metastore_db

2020-03-10 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056633#comment-17056633
 ] 

Wenchen Fan commented on SPARK-31099:
-

Is this only a problem for local hive metastore setup?

> Create migration script for metastore_db
> 
>
> Key: SPARK-31099
> URL: https://issues.apache.org/jira/browse/SPARK-31099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> When an existing Derby database exists (in ./metastore_db) created by Hive 
> 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile.
> Repro steps:
> 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive 
> -Phive-thriftserver. Make sure there's no existing ./metastore_db directory 
> in the repo.
> 2. Run bin/spark-shell, and then spark.sql("show databases"). This will 
> populate the ./metastore_db directory, where the Derby-based metastore 
> database is hosted. This database is populated from Hive 1.2.x.
> 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops 
> the Hive 1.2 profile, which makes it use the default Hive 2.3 profile)
> 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby 
> database created in Step (2), which triggers an upgrade step, and that's 
> where the following error will be reported.
> 5. Delete the ./metastore_db and re-run Step (4). The error is no longer 
> reported.
> {code:java}
> 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS 
> ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN 
> ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has 
> been specified as NOT NULL and either the DEFAULT clause was not specified or 
> was specified as DEFAULT NULL.
> java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column 
> 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT 
> clause was not specified or was specified as DEFAULT NULL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830)
>   at 
> org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896)
>   at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672)
>   at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1816)
>   at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744)
>   at org.datanucleus.store.query.Query.execute(Query.java:1726)
>   at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374)
>   at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:410)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342)
>   at 
> 

[jira] [Reopened] (SPARK-30895) The behavior of CsvToStructs should not depend on SQLConf.get

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-30895:
-

> The behavior of CsvToStructs should not depend on SQLConf.get
> -
>
> Key: SPARK-30895
> URL: https://issues.apache.org/jira/browse/SPARK-30895
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get

2020-03-10 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056632#comment-17056632
 ] 

Wenchen Fan commented on SPARK-30896:
-

`JsonToStructs` already stores the config value in a `val`, so the behavior 
won't change after the expression is created. There are some corner cases when 
we transform the expression tree with a different config, but it's not a 
critical bug. I've updated the priority to minor.

[~viirya] [~maropu] [~hyukjin.kwon] do you think it's worth to fix?

> The behavior of JsonToStructs should not depend on SQLConf.get
> --
>
> Key: SPARK-30896
> URL: https://issues.apache.org/jira/browse/SPARK-30896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-30896:
-

> The behavior of JsonToStructs should not depend on SQLConf.get
> --
>
> Key: SPARK-30896
> URL: https://issues.apache.org/jira/browse/SPARK-30896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30896:

Priority: Minor  (was: Major)

> The behavior of JsonToStructs should not depend on SQLConf.get
> --
>
> Key: SPARK-30896
> URL: https://issues.apache.org/jira/browse/SPARK-30896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30893) Expressions should not change its data type/nullability after it's created

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30893:

Summary: Expressions should not change its data type/nullability after it's 
created  (was: Expressions should not change its data type/behavior after it's 
created)

> Expressions should not change its data type/nullability after it's created
> --
>
> Key: SPARK-30893
> URL: https://issues.apache.org/jira/browse/SPARK-30893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Critical
> Fix For: 3.0.0
>
>
> This is a problem because the configuration can change between different 
> phases of planning, and this can silently break a query plan which can lead 
> to crashes or data corruption, if data type/nullability gets changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31113) Support DDL "SHOW VIEWS"

2020-03-10 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31113:

Issue Type: New Feature  (was: Bug)

> Support DDL "SHOW VIEWS"
> 
>
> Key: SPARK-31113
> URL: https://issues.apache.org/jira/browse/SPARK-31113
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> It is nice to have a `SHOW VIEWS` command similar to Hive 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31113) Support DDL "SHOW VIEWS"

2020-03-10 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056628#comment-17056628
 ] 

Xiao Li edited comment on SPARK-31113 at 3/11/20, 4:04 AM:
---

cc [~EricWu] Could you try this?


was (Author: smilegator):
cc [~EricWu] 

> Support DDL "SHOW VIEWS"
> 
>
> Key: SPARK-31113
> URL: https://issues.apache.org/jira/browse/SPARK-31113
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> It is nice to have a `SHOW VIEWS` command similar to Hive 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31113) Support DDL "SHOW VIEWS"

2020-03-10 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056628#comment-17056628
 ] 

Xiao Li commented on SPARK-31113:
-

cc [~EricWu] 

> Support DDL "SHOW VIEWS"
> 
>
> Key: SPARK-31113
> URL: https://issues.apache.org/jira/browse/SPARK-31113
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> It is nice to have a `SHOW VIEWS` command similar to Hive 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31113) Support DDL "SHOW VIEWS"

2020-03-10 Thread Xiao Li (Jira)
Xiao Li created SPARK-31113:
---

 Summary: Support DDL "SHOW VIEWS"
 Key: SPARK-31113
 URL: https://issues.apache.org/jira/browse/SPARK-31113
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li


It is nice to have a `SHOW VIEWS` command similar to Hive 
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews).
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31112) Use multiple extrenal catalog to speed up metastore access

2020-03-10 Thread deshanxiao (Jira)
deshanxiao created SPARK-31112:
--

 Summary: Use multiple extrenal catalog to speed up metastore access
 Key: SPARK-31112
 URL: https://issues.apache.org/jira/browse/SPARK-31112
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: deshanxiao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get

2020-03-10 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-30896.
-
Resolution: Won't Fix

> The behavior of JsonToStructs should not depend on SQLConf.get
> --
>
> Key: SPARK-30896
> URL: https://issues.apache.org/jira/browse/SPARK-30896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30895) The behavior of CsvToStructs should not depend on SQLConf.get

2020-03-10 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-30895.
-
Resolution: Won't Fix

> The behavior of CsvToStructs should not depend on SQLConf.get
> -
>
> Key: SPARK-30895
> URL: https://issues.apache.org/jira/browse/SPARK-30895
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30893) Expressions should not change its data type/behavior after it's created

2020-03-10 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-30893.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Expressions should not change its data type/behavior after it's created
> ---
>
> Key: SPARK-30893
> URL: https://issues.apache.org/jira/browse/SPARK-30893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Critical
> Fix For: 3.0.0
>
>
> This is a problem because the configuration can change between different 
> phases of planning, and this can silently break a query plan which can lead 
> to crashes or data corruption, if data type/nullability gets changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31104) Add documentation for all the Json Functions

2020-03-10 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056584#comment-17056584
 ] 

Rakesh Raushan commented on SPARK-31104:


I am working on it.

> Add documentation for all the Json Functions
> 
>
> Key: SPARK-31104
> URL: https://issues.apache.org/jira/browse/SPARK-31104
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31095) Upgrade netty-all to 4.1.47.Final

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056561#comment-17056561
 ] 

Dongjoon Hyun commented on SPARK-31095:
---

For `branch-2.4`, https://github.com/apache/spark/pull/27870 is created.

> Upgrade netty-all to 4.1.47.Final
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Vishwas Vijaya Kumar
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: security
> Fix For: 3.0.0
>
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31095) Upgrade netty-all to 4.1.47.Final

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31095.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27869
[https://github.com/apache/spark/pull/27869]

> Upgrade netty-all to 4.1.47.Final
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Vishwas Vijaya Kumar
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: security
> Fix For: 3.0.0
>
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31095) Upgrade netty-all to 4.1.47.Final

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31095:
-

Assignee: Dongjoon Hyun

> Upgrade netty-all to 4.1.47.Final
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Vishwas Vijaya Kumar
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: security
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31095) Upgrade netty-all to 4.1.47.Final

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31095:
--
Summary: Upgrade netty-all to 4.1.47.Final  (was: Upgrade netty version to 
fix security vulnerabilities)

> Upgrade netty-all to 4.1.47.Final
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Vishwas Vijaya Kumar
>Priority: Major
>  Labels: security
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31095) Upgrade netty version to fix security vulnerabilities

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31095:
--
Affects Version/s: (was: 2.4.4)
   3.1.0
   3.0.0

> Upgrade netty version to fix security vulnerabilities
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Vishwas Vijaya Kumar
>Priority: Major
>  Labels: security
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31099) Create migration script for metastore_db

2020-03-10 Thread Kris Mok (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056549#comment-17056549
 ] 

Kris Mok commented on SPARK-31099:
--

Just documenting the fact that users may encounter migration issues when 
upgrading from earlier versions of Spark to Spark 3.0 due to the Hive profile 
upgrade sounds good to me.

Derby migration is unlikely to be a production issue, and for other databases 
(MySQL / PG etc) they’re heavy enough that folks would probably realize it’s a 
Hive metastore migration issue just like what’d happen in Hive.

But the documentation should at the very least describe:
* upgraded Hive profile
* what kind of error messages could occur
* links to Hive documentation of how to perform the upgrade

WDYT?

> Create migration script for metastore_db
> 
>
> Key: SPARK-31099
> URL: https://issues.apache.org/jira/browse/SPARK-31099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> When an existing Derby database exists (in ./metastore_db) created by Hive 
> 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile.
> Repro steps:
> 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive 
> -Phive-thriftserver. Make sure there's no existing ./metastore_db directory 
> in the repo.
> 2. Run bin/spark-shell, and then spark.sql("show databases"). This will 
> populate the ./metastore_db directory, where the Derby-based metastore 
> database is hosted. This database is populated from Hive 1.2.x.
> 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops 
> the Hive 1.2 profile, which makes it use the default Hive 2.3 profile)
> 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby 
> database created in Step (2), which triggers an upgrade step, and that's 
> where the following error will be reported.
> 5. Delete the ./metastore_db and re-run Step (4). The error is no longer 
> reported.
> {code:java}
> 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS 
> ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN 
> ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has 
> been specified as NOT NULL and either the DEFAULT clause was not specified or 
> was specified as DEFAULT NULL.
> java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column 
> 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT 
> clause was not specified or was specified as DEFAULT NULL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830)
>   at 
> org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896)
>   at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672)
>   at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1816)
>   at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744)
>   at org.datanucleus.store.query.Query.execute(Query.java:1726)
>   at 

[jira] [Resolved] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]

2020-03-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30962.
--
Fix Version/s: 3.0.0
 Assignee: kevin yu
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/27779]

> Document ALTER TABLE statement in SQL Reference [Phase 2]
> -
>
> Key: SPARK-30962
> URL: https://issues.apache.org/jira/browse/SPARK-30962
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: kevin yu
>Priority: Major
> Fix For: 3.0.0
>
>
> https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of 
> ALTER TABLE statements. See the doc in preview-2 
> [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html]
>  
> We should add all the supported ALTER TABLE syntax. See 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31102) spark-sql fails to parse when contains comment

2020-03-10 Thread Javier Fuentes (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056466#comment-17056466
 ] 

Javier Fuentes commented on SPARK-31102:


Hey [~yumwang] I am checking this. Thanks!

> spark-sql fails to parse when contains comment
> --
>
> Key: SPARK-31102
> URL: https://issues.apache.org/jira/browse/SPARK-31102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> select
>   1,
>   -- two
>   2;
> {code}
> {noformat}
> spark-sql> select
>  >   1,
>  >   -- two
>  >   2;
> Error in query:
> mismatched input '' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', 
> 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 
> 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 
> 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
> 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
> 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
> 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
> 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
> DATABASES, 'DAY', 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 
> 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 
> 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 
> 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 
> 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 
> 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 
> 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'IF', 'IGNORE', 'IMPORT', 
> 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 
> 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 
> 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 
> 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 
> 'MATCHED', 'MERGE', 'MINUTE', 'MONTH', 'MSCK', 'NAMESPACE', 'NAMESPACES', 
> 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 
> 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 
> 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 
> 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 
> 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 
> 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 
> 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 
> 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SECOND', 'SELECT', 'SEMI', 'SEPARATED', 
> 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 
> 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 
> 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 
> 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 
> 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 
> 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 
> 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 
> 'VALUES', 'VIEW', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'YEAR', '+', '-', '*', 
> 'DIV', '~', STRING, BIGINT_LITERAL, SMALLINT_LITERAL, TINYINT_LITERAL, 
> INTEGER_VALUE, EXPONENT_VALUE, DECIMAL_VALUE, DOUBLE_LITERAL, 
> BIGDECIMAL_LITERAL, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 3, pos 2)
> == SQL ==
> select
>   1,
> --^^^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31099) Create migration script for metastore_db

2020-03-10 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056465#comment-17056465
 ] 

Jungtaek Lim commented on SPARK-31099:
--

[~dongjoon]

Could you elaborate your comment "Apache Spark 3.0 also doesn't support 
restarting from the old streaming checkpoint."? For sure Spark 3.0 should 
support the old checkpoint, except some cases which we have to discard old 
checkpoint to fix correctness issues.

> Create migration script for metastore_db
> 
>
> Key: SPARK-31099
> URL: https://issues.apache.org/jira/browse/SPARK-31099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> When an existing Derby database exists (in ./metastore_db) created by Hive 
> 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile.
> Repro steps:
> 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive 
> -Phive-thriftserver. Make sure there's no existing ./metastore_db directory 
> in the repo.
> 2. Run bin/spark-shell, and then spark.sql("show databases"). This will 
> populate the ./metastore_db directory, where the Derby-based metastore 
> database is hosted. This database is populated from Hive 1.2.x.
> 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops 
> the Hive 1.2 profile, which makes it use the default Hive 2.3 profile)
> 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby 
> database created in Step (2), which triggers an upgrade step, and that's 
> where the following error will be reported.
> 5. Delete the ./metastore_db and re-run Step (4). The error is no longer 
> reported.
> {code:java}
> 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS 
> ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN 
> ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has 
> been specified as NOT NULL and either the DEFAULT clause was not specified or 
> was specified as DEFAULT NULL.
> java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column 
> 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT 
> clause was not specified or was specified as DEFAULT NULL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830)
>   at 
> org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896)
>   at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672)
>   at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1816)
>   at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744)
>   at org.datanucleus.store.query.Query.execute(Query.java:1726)
>   at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374)
>   at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144)
>   

[jira] [Commented] (SPARK-31099) Create migration script for metastore_db

2020-03-10 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056461#comment-17056461
 ] 

Gengliang Wang commented on SPARK-31099:


[~dongjoon]Make sense. Let me close this one. Thank you.

> Create migration script for metastore_db
> 
>
> Key: SPARK-31099
> URL: https://issues.apache.org/jira/browse/SPARK-31099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> When an existing Derby database exists (in ./metastore_db) created by Hive 
> 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile.
> Repro steps:
> 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive 
> -Phive-thriftserver. Make sure there's no existing ./metastore_db directory 
> in the repo.
> 2. Run bin/spark-shell, and then spark.sql("show databases"). This will 
> populate the ./metastore_db directory, where the Derby-based metastore 
> database is hosted. This database is populated from Hive 1.2.x.
> 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops 
> the Hive 1.2 profile, which makes it use the default Hive 2.3 profile)
> 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby 
> database created in Step (2), which triggers an upgrade step, and that's 
> where the following error will be reported.
> 5. Delete the ./metastore_db and re-run Step (4). The error is no longer 
> reported.
> {code:java}
> 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS 
> ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN 
> ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has 
> been specified as NOT NULL and either the DEFAULT clause was not specified or 
> was specified as DEFAULT NULL.
> java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column 
> 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT 
> clause was not specified or was specified as DEFAULT NULL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830)
>   at 
> org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896)
>   at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672)
>   at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1816)
>   at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744)
>   at org.datanucleus.store.query.Query.execute(Query.java:1726)
>   at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374)
>   at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:410)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342)
>   at 
> 

[jira] [Resolved] (SPARK-31099) Create migration script for metastore_db

2020-03-10 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-31099.

Resolution: Won't Fix

> Create migration script for metastore_db
> 
>
> Key: SPARK-31099
> URL: https://issues.apache.org/jira/browse/SPARK-31099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> When an existing Derby database exists (in ./metastore_db) created by Hive 
> 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile.
> Repro steps:
> 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive 
> -Phive-thriftserver. Make sure there's no existing ./metastore_db directory 
> in the repo.
> 2. Run bin/spark-shell, and then spark.sql("show databases"). This will 
> populate the ./metastore_db directory, where the Derby-based metastore 
> database is hosted. This database is populated from Hive 1.2.x.
> 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops 
> the Hive 1.2 profile, which makes it use the default Hive 2.3 profile)
> 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby 
> database created in Step (2), which triggers an upgrade step, and that's 
> where the following error will be reported.
> 5. Delete the ./metastore_db and re-run Step (4). The error is no longer 
> reported.
> {code:java}
> 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS 
> ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN 
> ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has 
> been specified as NOT NULL and either the DEFAULT clause was not specified or 
> was specified as DEFAULT NULL.
> java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column 
> 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT 
> clause was not specified or was specified as DEFAULT NULL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830)
>   at 
> org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896)
>   at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672)
>   at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1816)
>   at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744)
>   at org.datanucleus.store.query.Query.execute(Query.java:1726)
>   at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374)
>   at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:410)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:303)
>   at 
> 

[jira] [Updated] (SPARK-31095) Upgrade netty version to fix security vulnerabilities

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31095:
--
  Component/s: (was: Security)
   Build
Fix Version/s: (was: 2.4.5)
   (was: 2.4.4)

> Upgrade netty version to fix security vulnerabilities
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Vishwas Vijaya Kumar
>Priority: Major
>  Labels: security
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31095) Upgrade netty version to fix security vulnerabilities

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31095:
--
Priority: Major  (was: Critical)

> Upgrade netty version to fix security vulnerabilities
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Vishwas Vijaya Kumar
>Priority: Major
>  Labels: security
> Fix For: 2.4.4, 2.4.5
>
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31095) Upgrade netty version to fix security vulnerabilities

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31095:
--
Issue Type: Bug  (was: Improvement)

> Upgrade netty version to fix security vulnerabilities
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Vishwas Vijaya Kumar
>Priority: Major
>  Labels: security
> Fix For: 2.4.4, 2.4.5
>
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31095) Upgrade netty version to fix security vulnerabilities

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056444#comment-17056444
 ] 

Dongjoon Hyun commented on SPARK-31095:
---

Hi, [~visvijay]. You should not set `Fix Version`. Please see the contribution 
guide.
- https://spark.apache.org/contributing.html

> Upgrade netty version to fix security vulnerabilities
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Vishwas Vijaya Kumar
>Priority: Critical
>  Labels: security
> Fix For: 2.4.4, 2.4.5
>
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056439#comment-17056439
 ] 

Dongjoon Hyun commented on SPARK-31098:
---

Thank you, [~Gengliang.Wang].

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> 

[jira] [Commented] (SPARK-31099) Create migration script for metastore_db

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056436#comment-17056436
 ] 

Dongjoon Hyun commented on SPARK-31099:
---

Hi, [~Gengliang.Wang] and [~smilegator] and [~cloud_fan].
This doesn't sound like what Apache Spark provides officially.
If needed, the users can use the official scripts by `Apache Hive` project.
In addition to that, Apache Spark 3.0 also doesn't support restarting from the 
old streaming checkpoint.

> Create migration script for metastore_db
> 
>
> Key: SPARK-31099
> URL: https://issues.apache.org/jira/browse/SPARK-31099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> When an existing Derby database exists (in ./metastore_db) created by Hive 
> 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile.
> Repro steps:
> 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive 
> -Phive-thriftserver. Make sure there's no existing ./metastore_db directory 
> in the repo.
> 2. Run bin/spark-shell, and then spark.sql("show databases"). This will 
> populate the ./metastore_db directory, where the Derby-based metastore 
> database is hosted. This database is populated from Hive 1.2.x.
> 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops 
> the Hive 1.2 profile, which makes it use the default Hive 2.3 profile)
> 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby 
> database created in Step (2), which triggers an upgrade step, and that's 
> where the following error will be reported.
> 5. Delete the ./metastore_db and re-run Step (4). The error is no longer 
> reported.
> {code:java}
> 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS 
> ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN 
> ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has 
> been specified as NOT NULL and either the DEFAULT clause was not specified or 
> was specified as DEFAULT NULL.
> java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column 
> 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT 
> clause was not specified or was specified as DEFAULT NULL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830)
>   at 
> org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896)
>   at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672)
>   at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1816)
>   at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744)
>   at org.datanucleus.store.query.Query.execute(Query.java:1726)
>   at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374)
>   at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184)
>   at 
> 

[jira] [Updated] (SPARK-31102) spark-sql fails to parse when contains comment

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31102:
--
Target Version/s: 3.0.0

> spark-sql fails to parse when contains comment
> --
>
> Key: SPARK-31102
> URL: https://issues.apache.org/jira/browse/SPARK-31102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> select
>   1,
>   -- two
>   2;
> {code}
> {noformat}
> spark-sql> select
>  >   1,
>  >   -- two
>  >   2;
> Error in query:
> mismatched input '' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', 
> 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 
> 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 
> 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
> 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
> 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
> 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
> 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
> DATABASES, 'DAY', 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 
> 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 
> 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 
> 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 
> 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 
> 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 
> 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'IF', 'IGNORE', 'IMPORT', 
> 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 
> 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 
> 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 
> 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 
> 'MATCHED', 'MERGE', 'MINUTE', 'MONTH', 'MSCK', 'NAMESPACE', 'NAMESPACES', 
> 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 
> 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 
> 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 
> 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 
> 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 
> 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 
> 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 
> 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SECOND', 'SELECT', 'SEMI', 'SEPARATED', 
> 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 
> 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 
> 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 
> 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 
> 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 
> 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 
> 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 
> 'VALUES', 'VIEW', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'YEAR', '+', '-', '*', 
> 'DIV', '~', STRING, BIGINT_LITERAL, SMALLINT_LITERAL, TINYINT_LITERAL, 
> INTEGER_VALUE, EXPONENT_VALUE, DECIMAL_VALUE, DOUBLE_LITERAL, 
> BIGDECIMAL_LITERAL, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 3, pos 2)
> == SQL ==
> select
>   1,
> --^^^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30667) Support simple all gather in barrier task context

2020-03-10 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-30667.
--
Resolution: Done

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Sarth Frey
>Priority: Major
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28594) Allow event logs for running streaming apps to be rolled over

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28594:
--
Summary: Allow event logs for running streaming apps to be rolled over  
(was: Allow event logs for running streaming apps to be rolled over.)

> Allow event logs for running streaming apps to be rolled over
> -
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Stephen Levett
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22783:
--
Parent: SPARK-28594
Issue Type: Sub-task  (was: Bug)

> event log directory(spark-history) filled by large .inprogress files for 
> spark streaming applications
> -
>
> Key: SPARK-22783
> URL: https://issues.apache.org/jira/browse/SPARK-22783
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.1.0
> Environment: Linux(Generic)
>Reporter: omkar kankalapati
>Priority: Major
>
> When running long running streaming applications, the HDFS storage gets 
> filled up with large  *.inprogress files in hdfs://spark-history/  directory
> For example:
>  hadoop fs -du -h /spark-history
> 234 /spark-history/.inprogress
> 46.6 G  /spark-history/.inprogress
> Instead of continuing to write to a very large (multi GB) .inprogress file,  
> Spark should instead rotate the current log file when it reaches a size (for 
> example:  100 MB) or interval
> and perhaps expose a configuration parameter for the size/interval.
> This is also mentioned in SPARK-12140 as a concern.
> It is very important and useful to support rotating the log files because 
> users may have limited HDFS quota and these large files consume the available 
> limited quota.
> Also the users do not have a viable workaround
> 1) Can not move the files to an another location because the moving  the file 
> causes the event logging to stop
> 2) Trying to copy the .inprogress file to another location and truncate the 
> .inprogress file fails because the file is still opened by 
> EventLoggingListener for writing
> hdfs dfs -truncate -w 0 /spark-history/.inprogress
> truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress 
> for DFSClient_NONMAPREDUCE_<#ID>on  because this file lease is currently 
> owned by DFSClient_NONMAPREDUCE_<#ID> on 
> The only workaround available is to disable the event logging for streaming 
> applications by setting "spark.eventLog.enabled" to false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-22783.
-

> event log directory(spark-history) filled by large .inprogress files for 
> spark streaming applications
> -
>
> Key: SPARK-22783
> URL: https://issues.apache.org/jira/browse/SPARK-22783
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.1.0
> Environment: Linux(Generic)
>Reporter: omkar kankalapati
>Priority: Major
>
> When running long running streaming applications, the HDFS storage gets 
> filled up with large  *.inprogress files in hdfs://spark-history/  directory
> For example:
>  hadoop fs -du -h /spark-history
> 234 /spark-history/.inprogress
> 46.6 G  /spark-history/.inprogress
> Instead of continuing to write to a very large (multi GB) .inprogress file,  
> Spark should instead rotate the current log file when it reaches a size (for 
> example:  100 MB) or interval
> and perhaps expose a configuration parameter for the size/interval.
> This is also mentioned in SPARK-12140 as a concern.
> It is very important and useful to support rotating the log files because 
> users may have limited HDFS quota and these large files consume the available 
> limited quota.
> Also the users do not have a viable workaround
> 1) Can not move the files to an another location because the moving  the file 
> causes the event logging to stop
> 2) Trying to copy the .inprogress file to another location and truncate the 
> .inprogress file fails because the file is still opened by 
> EventLoggingListener for writing
> hdfs dfs -truncate -w 0 /spark-history/.inprogress
> truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress 
> for DFSClient_NONMAPREDUCE_<#ID>on  because this file lease is currently 
> owned by DFSClient_NONMAPREDUCE_<#ID> on 
> The only workaround available is to disable the event logging for streaming 
> applications by setting "spark.eventLog.enabled" to false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28594:
--
Environment: (was: This has been reported on 2.0.2.22 but affects all 
currently available versions.)

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Stephen Levett
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056417#comment-17056417
 ] 

Dongjoon Hyun commented on SPARK-28594:
---

I assigned this umbrella to [~kabhwan] since he lead this actively.

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28594:
-

Assignee: Jungtaek Lim

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-29581) Enable cleanup old event log files

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-29581.
-

> Enable cleanup old event log files 
> ---
>
> Key: SPARK-29581
> URL: https://issues.apache.org/jira/browse/SPARK-29581
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue can be start only with SPARK-29579 is addressed properly.
> After SPARK-29579 Spark would guarantee strong compatibility on both live 
> entities and snapshots, which means snapshot file could replace older origin 
> event log files. This issue tracks the efforts on automatically cleaning up 
> old event logs if snapshot file can replace them, which lets overall size of 
> event log on streaming query to be manageable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30860) Different behavior between rolling and non-rolling event log

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30860:
--
Parent: SPARK-28594
Issue Type: Sub-task  (was: Bug)

> Different behavior between rolling and non-rolling event log
> 
>
> Key: SPARK-30860
> URL: https://issues.apache.org/jira/browse/SPARK-30860
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Adam Binford
>Priority: Major
>
> When creating a rolling event log, the application directory is created with 
> a call to FileSystem.mkdirs, with the file permission 770. The default 
> behavior of HDFS is to set the permission of a file created with 
> FileSystem.create or FileSystem.mkdirs to (P & ^umask), where P is the 
> permission in the API call and umask is a system value set by 
> fs.permissions.umask-mode and defaults to 0022. This means, with default 
> settings, any mkdirs call can have at most 755 permissions, which causes 
> rolling event log directories to be created with 750 permissions. This causes 
> the history server to be unable to prune old applications if they are not run 
> by the same user running the history server.
> This is not a problem for non-rolling logs, because it uses 
> SparkHadoopUtils.createFile for Hadoop 2 backward compatibility, and then 
> calls FileSystem.setPermission with 770 after the file has been created. 
> setPermission doesn't have the umask applied to it, so this works fine.
> Obviously this could be fixed by changing fs.permissions.umask-mode, but I'm 
> not sure the reason that's set in the first place or if this would hurt 
> anything else. The main issue is there is different behavior between rolling 
> and non-rolling event logs that might want to be updated in this repo to be 
> consistent across each.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056331#comment-17056331
 ] 

Gengliang Wang commented on SPARK-31098:


[~dongjoon]Thanks for the explanation!  I am closing this issue for now.

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> 

[jira] [Resolved] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-31098.

Resolution: Later

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> 

[jira] [Updated] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31098:
--
Issue Type: Bug  (was: Improvement)

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> 

[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056326#comment-17056326
 ] 

Dongjoon Hyun commented on SPARK-31098:
---

Please note that the fixed behavior is not also desirable sometime because the 
schema is randomly chosen by the smallest ORC file.

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> 

[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056324#comment-17056324
 ] 

Dongjoon Hyun commented on SPARK-31098:
---

It will work only in the above cases (I also checked yesterday) but broke the 
other test cases. If you want, you can make a PR with a complete patch. 
However, in general, we may end up upgrading ORC dependency in `branch-2.4`.
I'd like to hold on the ORC dependency upgrade in `branch-2.4` because ORC 
changes a lot even in `1.5.x`. (I send a relevant email to the dev mailing list 
before).
I prefer to rethink backporting again after we release `Apache Spark 3.0.0` 
with the new ORC versions and it looks stable in many environment.

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   

[jira] [Updated] (SPARK-31110) refine sql doc for SELECT

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31110:
--
Issue Type: Documentation  (was: Improvement)

> refine sql doc for SELECT
> -
>
> Key: SPARK-31110
> URL: https://issues.apache.org/jira/browse/SPARK-31110
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31110) refine sql doc for SELECT

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31110:
--
Component/s: Documentation

> refine sql doc for SELECT
> -
>
> Key: SPARK-31110
> URL: https://issues.apache.org/jira/browse/SPARK-31110
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056286#comment-17056286
 ] 

Gengliang Wang commented on SPARK-31098:


[~dongjoon]Thank you so much for looking into it.
I tried porting the changes 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L167-L169
 and it works.
Why do you think it is risky?

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> 

[jira] [Updated] (SPARK-30510) Publicly document options under spark.sql.*

2020-03-10 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30510:

Labels: release-notes  (was: )

> Publicly document options under spark.sql.*
> ---
>
> Key: SPARK-30510
> URL: https://issues.apache.org/jira/browse/SPARK-30510
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
> but it doesn't appear to be documented in [the expected 
> place|http://spark.apache.org/docs/2.4.4/configuration.html]. In fact, none 
> of the options under {{spark.sql.*}} that are intended for users are 
> documented on spark.apache.org/docs.
> We should add a new documentation page for these options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31037) refine AQE config names

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056227#comment-17056227
 ] 

Dongjoon Hyun commented on SPARK-31037:
---

This is resolved via https://github.com/apache/spark/pull/27793 .

> refine AQE config names
> ---
>
> Key: SPARK-31037
> URL: https://issues.apache.org/jira/browse/SPARK-31037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31037) refine AQE config names

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31037.
---
Fix Version/s: 3.0.0
 Assignee: Wenchen Fan
   Resolution: Fixed

> refine AQE config names
> ---
>
> Key: SPARK-31037
> URL: https://issues.apache.org/jira/browse/SPARK-31037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31111) Fix interval output issue in ExtractBenchmark

2020-03-10 Thread Kent Yao (Jira)
Kent Yao created SPARK-3:


 Summary: Fix interval output issue in ExtractBenchmark 
 Key: SPARK-3
 URL: https://issues.apache.org/jira/browse/SPARK-3
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31110) refine sql doc for SELECT

2020-03-10 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31110:
---

 Summary: refine sql doc for SELECT
 Key: SPARK-31110
 URL: https://issues.apache.org/jira/browse/SPARK-31110
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30189) Interval from year-month/date-time string handling whitespaces

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30189.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26815
[https://github.com/apache/spark/pull/26815]

> Interval from year-month/date-time string handling whitespaces
> --
>
> Key: SPARK-30189
> URL: https://issues.apache.org/jira/browse/SPARK-30189
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> # for pg feature parity
>  # for consistency with other types and other interval parser
>  
> {code:sql}
> ostgres=# select interval E'2-2\t' year to month;
> interval
> 
>  2 years 2 mons
> (1 row)
> postgres=# select interval E'2-2\t' year to month;
> interval
> 
>  2 years 2 mons
> (1 row)
> postgres=# select interval E'2-\t2\t' year to month;
> ERROR:  invalid input syntax for type interval: "2-   2   "
> LINE 1: select interval E'2-\t2\t' year to month;
> ^
> postgres=# select interval '2  00:00:01' day to second;
> interval
> -
>  2 days 00:00:01
> (1 row)
> postgres=# select interval '- 2  00:00:01' day to second;
>  interval
> ---
>  -2 days +00:00:01
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30189) Interval from year-month/date-time string handling whitespaces

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30189:
---

Assignee: Kent Yao

> Interval from year-month/date-time string handling whitespaces
> --
>
> Key: SPARK-30189
> URL: https://issues.apache.org/jira/browse/SPARK-30189
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> # for pg feature parity
>  # for consistency with other types and other interval parser
>  
> {code:sql}
> ostgres=# select interval E'2-2\t' year to month;
> interval
> 
>  2 years 2 mons
> (1 row)
> postgres=# select interval E'2-2\t' year to month;
> interval
> 
>  2 years 2 mons
> (1 row)
> postgres=# select interval E'2-\t2\t' year to month;
> ERROR:  invalid input syntax for type interval: "2-   2   "
> LINE 1: select interval E'2-\t2\t' year to month;
> ^
> postgres=# select interval '2  00:00:01' day to second;
> interval
> -
>  2 days 00:00:01
> (1 row)
> postgres=# select interval '- 2  00:00:01' day to second;
>  interval
> ---
>  -2 days +00:00:01
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31078) outputOrdering should handle aliases correctly

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31078.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27842
[https://github.com/apache/spark/pull/27842]

> outputOrdering should handle aliases correctly
> --
>
> Key: SPARK-31078
> URL: https://issues.apache.org/jira/browse/SPARK-31078
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, `outputOrdering` doesn't respect aliases. Thus, the following 
> would produce an unnecessary sort node:
> {code:java}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
>   val df = (0 until 20).toDF("i").as("df")
>   df.repartition(8, df("i")).write.format("parquet")
> .bucketBy(8, "i").sortBy("i").saveAsTable("t")
>   val t1 = spark.table("t")
>   val t2 = t1.selectExpr("i as ii")
>   t1.join(t2, t1("i") === t2("ii")).explain
> }
> {code}
> would produce an unnecessary sort node:
> {code:java}
> == Physical Plan ==
> *(3) SortMergeJoin [i#8], [ii#10], Inner
> :- *(1) Project [i#8]
> :  +- *(1) Filter isnotnull(i#8)
> : +- *(1) ColumnarToRow
> :+- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
> [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
> PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: 
> struct, SelectedBucketsCount: 8 out of 8
> +- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0
>+- *(2) Project [i#8 AS ii#10]
>   +- *(2) Filter isnotnull(i#8)
>  +- *(2) ColumnarToRow
> +- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
> [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
> PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: 
> struct, SelectedBucketsCount: 8 out of 8
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31078) outputOrdering should handle aliases correctly

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31078:
---

Assignee: Terry Kim

> outputOrdering should handle aliases correctly
> --
>
> Key: SPARK-31078
> URL: https://issues.apache.org/jira/browse/SPARK-31078
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> Currently, `outputOrdering` doesn't respect aliases. Thus, the following 
> would produce an unnecessary sort node:
> {code:java}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
>   val df = (0 until 20).toDF("i").as("df")
>   df.repartition(8, df("i")).write.format("parquet")
> .bucketBy(8, "i").sortBy("i").saveAsTable("t")
>   val t1 = spark.table("t")
>   val t2 = t1.selectExpr("i as ii")
>   t1.join(t2, t1("i") === t2("ii")).explain
> }
> {code}
> would produce an unnecessary sort node:
> {code:java}
> == Physical Plan ==
> *(3) SortMergeJoin [i#8], [ii#10], Inner
> :- *(1) Project [i#8]
> :  +- *(1) Filter isnotnull(i#8)
> : +- *(1) ColumnarToRow
> :+- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
> [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
> PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: 
> struct, SelectedBucketsCount: 8 out of 8
> +- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0
>+- *(2) Project [i#8 AS ii#10]
>   +- *(2) Filter isnotnull(i#8)
>  +- *(2) ColumnarToRow
> +- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
> [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
> PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: 
> struct, SelectedBucketsCount: 8 out of 8
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31079) Add RuleExecutor metrics in Explain Formatted

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31079:
---

Assignee: Xin Wu

> Add RuleExecutor metrics in Explain Formatted
> -
>
> Key: SPARK-31079
> URL: https://issues.apache.org/jira/browse/SPARK-31079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xin Wu
>Assignee: Xin Wu
>Priority: Major
>
> RuleExecutor already support metering for analyzer/optimizer. By providing 
> such information in Explain command, user can get better user experience when 
> debugging a specific query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31079) Add RuleExecutor metrics in Explain Formatted

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31079.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27846
[https://github.com/apache/spark/pull/27846]

> Add RuleExecutor metrics in Explain Formatted
> -
>
> Key: SPARK-31079
> URL: https://issues.apache.org/jira/browse/SPARK-31079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xin Wu
>Assignee: Xin Wu
>Priority: Major
> Fix For: 3.0.0
>
>
> RuleExecutor already support metering for analyzer/optimizer. By providing 
> such information in Explain command, user can get better user experience when 
> debugging a specific query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31109) Add version information to the configuration of Mesos

2020-03-10 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-31109:
--

 Summary: Add version information to the configuration of Mesos
 Key: SPARK-31109
 URL: https://issues.apache.org/jira/browse/SPARK-31109
 Project: Spark
  Issue Type: Sub-task
  Components: Mesos
Affects Versions: 3.1.0
Reporter: jiaan.geng


esource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/config.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31108) Parameter cannot be passed to pandas udf of type map_iter

2020-03-10 Thread xge (Jira)
xge created SPARK-31108:
---

 Summary: Parameter cannot be passed to pandas udf of type map_iter
 Key: SPARK-31108
 URL: https://issues.apache.org/jira/browse/SPARK-31108
 Project: Spark
  Issue Type: Question
  Components: Examples
Affects Versions: 3.0.0
Reporter: xge


Parameters can only be passed in the following way:



from pyspark.sql.functions import pandas_udf, PandasUDFType

def map_iter_pandas_udf_example(spark):
    strr = "abcd
    df = spark.createDataFrame([(1, 21),(2,30)],("id", "age")) 

    @pandas_udf(df.schema, PandasUDFType.MAP_ITER)
    def filter_func(batch_iter, x = strr):
        print(x)
        for pdf in batch_iter:
            yield pdf[pdf.id == 1]

    df.mapInPandas(filter_func).show()

***

 

However, if the code edited as follow, error ccurred:

***

from pyspark.sql.functions import pandas_udf, PandasUDFType

def map_iter_pandas_udf_example(spark):
    strr = "abcd
    df = spark.createDataFrame([(1, 21),(2,30)],("id", "age")) 

    @pandas_udf(df.schema, PandasUDFType.MAP_ITER)
    def filter_func(batch_iter, x = strr):
        print(x)
        for pdf in batch_iter:
            yield pdf[pdf.id == 1]

    data = "dbca"

    df.mapInPandas(filter_func(data)).show()

***

ValueError: Invalid udf: the udf argument must be a pandas_udf of type MAP_ITER.

Does anyone know if pandas udf of type map_iter can pass parameters, and if so, 
how to write the code? Thanks.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055786#comment-17055786
 ] 

Dongjoon Hyun commented on SPARK-31098:
---

Hmm. [~Gengliang.Wang]. Unfortunately, this seems to be risky in `branch-2.4`. 
Shall we close this because SPARK-27034 supersedes this already in 3.0?

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   

[jira] [Created] (SPARK-31107) Extend FairScheduler to support pool level resource isolation

2020-03-10 Thread liupengcheng (Jira)
liupengcheng created SPARK-31107:


 Summary: Extend FairScheduler to support pool level resource 
isolation
 Key: SPARK-31107
 URL: https://issues.apache.org/jira/browse/SPARK-31107
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: liupengcheng


Currently, spark only provided two types of scheduler: FIFO & FAIR, but in sql 
high-concurrency scenarios, a few of drawbacks are exposed.

FIFO: it can easily causing congestion when large sql query occupies all the 
resources

FAIR: the taskSets of one pool may occupies all the resource due to there are 
no hard limit on the maximum usage for each pool.  this case may be frequently 
met under high workloads.

So we propose to add a maxShare argument for FairScheduler to control the 
maximum running tasks for each pool.

One thing that needs our attention is that we should handle it well to make the 
`ExecutorAllocationManager` can release resources:
 e.g. Suppose we got 100 executors, if the tasks are scheduled on all executors 
with max concurrency 50, there are cases that the executors may not idle, and 
can not be released.

One idea is to bind those executors to each pool, then we only schedule tasks 
on executors of the pool which it belongs to.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31106) Support IS_JSON

2020-03-10 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055757#comment-17055757
 ] 

Rakesh Raushan commented on SPARK-31106:


I am working on it.

> Support IS_JSON
> ---
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Currently, null is returned when we come across invalid json. We should 
> either throw an exception for invalid json or false should be returned, like 
> in other DBMSs. Like in `json_array_length` function we need to return NULL 
> for null array. So this might confuse users.
>  
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31106) Support IS_JSON

2020-03-10 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31106:
--

 Summary: Support IS_JSON
 Key: SPARK-31106
 URL: https://issues.apache.org/jira/browse/SPARK-31106
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Rakesh Raushan


Currently, null is returned when we come across invalid json. We should either 
throw an exception for invalid json or false should be returned, like in other 
DBMSs. Like in `json_array_length` function we need to return NULL for null 
array. So this might confuse users.

 

DBMSs supporting this functions are :
 * MySQL
 * SQL Server
 * Sqlite
 * MariaDB



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31105) Respect sql execution id when scheduling taskSets

2020-03-10 Thread liupengcheng (Jira)
liupengcheng created SPARK-31105:


 Summary: Respect sql execution id when scheduling taskSets
 Key: SPARK-31105
 URL: https://issues.apache.org/jira/browse/SPARK-31105
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: liupengcheng


Currently, spark will sort taskSets by jobId and stageId and then schedule them 
in order for FIFO schedulingMode. In OLAP senerios, especially under high 
concurrency, the taskSets are always from different sql queries and several 
jobs can be submitted for execution at one time  for one query for adaptive 
execution. But now we order those taskSets without considering the execution 
group, which may causes the query being delayed.
So I propose to consider the sql execution id when scheduling jobs.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055742#comment-17055742
 ] 

Dongjoon Hyun commented on SPARK-31098:
---

I'll make a small bug fix for this use case to prevent exception at least. 

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> 

[jira] [Comment Edited] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055731#comment-17055731
 ] 

Dongjoon Hyun edited comment on SPARK-31098 at 3/10/20, 9:26 AM:
-

-Hmm. It seems that there is more patches for this in addition to that. Let me 
dig more.-

SPARK-27034 is correct. You need the following especially to backport what you 
want.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L167-L169


was (Author: dongjoon):
~Hmm. It seems that there is more patches for this in addition to that. Let me 
dig more.~

SPARK-27034 is correct. You need the following especially to backport what you 
want.
- 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L167-L169

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost 

[jira] [Comment Edited] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055731#comment-17055731
 ] 

Dongjoon Hyun edited comment on SPARK-31098 at 3/10/20, 9:26 AM:
-

~Hmm. It seems that there is more patches for this in addition to that. Let me 
dig more.~

SPARK-27034 is correct. You need the following especially to backport what you 
want.
- 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L167-L169


was (Author: dongjoon):
Hmm. It seems that there is more patches for this in addition to that. Let me 
dig more.

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> 

[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055731#comment-17055731
 ] 

Dongjoon Hyun commented on SPARK-31098:
---

Hmm. It seems that there is more patches for this in addition to that. Let me 
dig more.

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> 

[jira] [Created] (SPARK-31104) Add documentation for all the Json Functions

2020-03-10 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31104:
--

 Summary: Add documentation for all the Json Functions
 Key: SPARK-31104
 URL: https://issues.apache.org/jira/browse/SPARK-31104
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.1.0
Reporter: Rakesh Raushan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055724#comment-17055724
 ] 

Dongjoon Hyun commented on SPARK-31098:
---

And, SPARK-27034 is the fix for this case in 3.0. As you see, this is `struct`.

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> 

[jira] [Created] (SPARK-31103) Extend Support for useful JSON Functions

2020-03-10 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31103:
--

 Summary: Extend Support for useful JSON Functions
 Key: SPARK-31103
 URL: https://issues.apache.org/jira/browse/SPARK-31103
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.1.0
Reporter: Rakesh Raushan


Currently, Spark only supports few functions for JSON. There are many other 
common utility functions which are supported by other popular DBMSs. Supporting 
these functions will make it easier for prospective users. Also some functions 
like `json_array_length` , `json_object_keys` are more intuitive and naive 
users life would be much simpler.

I have added some JSON functions on which I am working on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30992) Arrange scattered config of streaming module

2020-03-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30992:


Assignee: jiaan.geng

> Arrange scattered config of streaming module
> 
>
> Key: SPARK-30992
> URL: https://issues.apache.org/jira/browse/SPARK-30992
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> I found a lot scattered config in Streaming module.
> I think should arrange these config in unified position.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30992) Arrange scattered config of streaming module

2020-03-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30992.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27744
[https://github.com/apache/spark/pull/27744]

> Arrange scattered config of streaming module
> 
>
> Key: SPARK-30992
> URL: https://issues.apache.org/jira/browse/SPARK-30992
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> I found a lot scattered config in Streaming module.
> I think should arrange these config in unified position.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055712#comment-17055712
 ] 

Gengliang Wang commented on SPARK-31098:


[~dongjoon] Yes, one of the files missing the column `a5`

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> 

[jira] [Closed] (SPARK-30784) Hive 2.3 profile should still use orc-nohive

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-30784.
-

> Hive 2.3 profile should still use orc-nohive
> 
>
> Key: SPARK-30784
> URL: https://issues.apache.org/jira/browse/SPARK-30784
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yin Huai
>Priority: Critical
>
> Originally reported at 
> [https://github.com/apache/spark/pull/26619#issuecomment-583802901]
>  
> Right now, Hive 2.3 profile pulls in regular orc, which depends on 
> hive-storage-api. However, hive-storage-api and hive-common have the 
> following common class files
>  
> org/apache/hadoop/hive/common/ValidReadTxnList.class
>  org/apache/hadoop/hive/common/ValidTxnList.class
>  org/apache/hadoop/hive/common/ValidTxnList$RangeResponse.class
> For example, 
> [https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java]
>  (pulled in by orc 1.5.8) and 
> [https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java]
>  (from hive-common 2.3.6) both are in the classpath and they are different. 
> Having both versions in the classpath can cause unexpected behavior due to 
> classloading order. We should still use orc-nohive, which has 
> hive-storage-api shaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

2020-03-10 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-30443:
--
Affects Version/s: 3.0.0

> "Managed memory leak detected" even with no calls to take() or limit()
> --
>
> Key: SPARK-30443
> URL: https://issues.apache.org/jira/browse/SPARK-30443
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.4, 3.0.0
>Reporter: Luke Richter
>Priority: Major
> Attachments: a.csv.zip, b.csv.zip, c.csv.zip
>
>
> Our Spark code is causing a "Managed memory leak detected" warning to appear, 
> even though we are not calling take() or limit().
> According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
> managed memory leaks should only be caused by not reading an iterator to 
> completion, i.e. take() or limit()
> Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
> memory leak detected; size = 2097152 bytes, TID = 118"
>  The size of the managed memory leak is always 2MB.
> I have created a minimal test program that reproduces the warning: 
> {code:java}
> import pyspark.sql
> import pyspark.sql.functions as fx
> def main():
> builder = pyspark.sql.SparkSession.builder
> builder = builder.appName("spark-jira")
> spark = builder.getOrCreate()
> reader = spark.read
> reader = reader.format("csv")
> reader = reader.option("inferSchema", "true")
> reader = reader.option("header", "true")
> table_c = reader.load("c.csv")
> table_a = reader.load("a.csv")
> table_b = reader.load("b.csv")
> primary_filter = fx.col("some_code").isNull()
> new_primary_data = table_a.filter(primary_filter)
> new_ids = new_primary_data.select("some_id")
> new_data = table_b.join(new_ids, "some_id")
> new_data = new_data.select("some_id")
> result = table_c.join(new_data, "some_id", "left")
> result.repartition(1).write.json("results.json", mode="overwrite")
> spark.stop()
> if __name__ == "__main__":
> main()
> {code}
> Our code isn't anything out of the ordinary, just some filters, selects and 
> joins.
> The input data is made up of 3 CSV files. The input data files are quite 
> large, roughly 2.6GB in total uncompressed. I attempted to reduce the number 
> of rows in the CSV input files but this caused the warning to no longer 
> appear. After compressing the files I was able to attach them below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055696#comment-17055696
 ] 

Dongjoon Hyun edited comment on SPARK-31098 at 3/10/20, 8:37 AM:
-

I guess your expectation is the behavior of `mergeSchema`, isn't it?
One file is missing column `a5`.


was (Author: dongjoon):
I guess your expectation is the behavior of `mergeSchema`, isn't it?

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> 

[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055696#comment-17055696
 ] 

Dongjoon Hyun commented on SPARK-31098:
---

I guess your expectation is the behavior of `mergeSchema`, isn't it?

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> 

[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055693#comment-17055693
 ] 

Dongjoon Hyun commented on SPARK-31098:
---

Hi, [~Gengliang.Wang]. 

First of all, even in 3.0.0, the result schema depends on the order of files 
(technically, the size of files because Spark do sorting.)
{code}
scala> spark.read.orc("/tmp/orc/2019-01-02", "/tmp/orc/2019-01-01").printSchema
root
 |-- a: struct (nullable = true)
 ||-- a1: integer (nullable = true)
 ||-- a2: string (nullable = true)
 ||-- a3: timestamp (nullable = true)
 ||-- a4: string (nullable = true)
 ||-- a5: integer (nullable = true)
 |-- b: struct (nullable = true)
 ||-- b1: integer (nullable = true)
 ||-- b2: string (nullable = true)


scala> spark.read.orc("/tmp/orc/2019-01-01", "/tmp/orc/2019-01-02").printSchema
root
 |-- a: struct (nullable = true)
 ||-- a1: integer (nullable = true)
 ||-- a2: string (nullable = true)
 ||-- a3: timestamp (nullable = true)
 ||-- a4: string (nullable = true)
 |-- b: struct (nullable = true)
 ||-- b1: integer (nullable = true)
 ||-- b2: string (nullable = true)

scala> spark.version
res11: String = 3.0.0-preview2
{code}

So, to be consistent, `mergeSchema` is the only solution.
{code}
scala> spark.read.option("mergeSchema", "true").orc("/tmp/orc/2019-01-01", 
"/tmp/orc/2019-01-02").printSchema
root
 |-- a: struct (nullable = true)
 ||-- a1: integer (nullable = true)
 ||-- a2: string (nullable = true)
 ||-- a3: timestamp (nullable = true)
 ||-- a4: string (nullable = true)
 ||-- a5: integer (nullable = true)
 |-- b: struct (nullable = true)
 ||-- b1: integer (nullable = true)
 ||-- b2: string (nullable = true)
{code}

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> 

[jira] [Updated] (SPARK-11412) Support merge schema for ORC

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-11412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-11412:
--
Affects Version/s: 2.3.4
   2.4.5

> Support merge schema for ORC
> 
>
> Key: SPARK-11412
> URL: https://issues.apache.org/jira/browse/SPARK-11412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.0, 2.1.1, 2.2.0, 2.3.4, 2.4.5
>Reporter: Dave
>Assignee: EdisonWang
>Priority: Major
> Fix For: 3.0.0
>
>
> when I tried to load partitioned orc files with a slight difference in a 
> nested column. say 
> column 
> -- request: struct (nullable = true)
>  ||-- datetime: string (nullable = true)
>  ||-- host: string (nullable = true)
>  ||-- ip: string (nullable = true)
>  ||-- referer: string (nullable = true)
>  ||-- request_uri: string (nullable = true)
>  ||-- uri: string (nullable = true)
>  ||-- useragent: string (nullable = true)
> And then there's a page_url_lists attributes in the later partitions.
> I tried to use
> val s = sqlContext.read.format("orc").option("mergeSchema", 
> "true").load("/data/warehouse/") to load the data.
> But the schema doesn't show request.page_url_lists.
> I am wondering if schema merge doesn't work for orc?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException

2020-03-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055672#comment-17055672
 ] 

Dongjoon Hyun commented on SPARK-31098:
---

Thank you for pinging me and the file. Let me take a look.

> Reading ORC files throws IndexOutOfBoundsException
> --
>
> Key: SPARK-31098
> URL: https://issues.apache.org/jira/browse/SPARK-31098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: files.tar
>
>
> On reading the attached ORC file which contains null value in nested field, 
> there is such exception:
> {code:java}
> scala> spark.read.orc("/tmp/files/").show()
> 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4
>   at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
>   at 
> 

[jira] [Resolved] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31065.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27854
[https://github.com/apache/spark/pull/27854]

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 
> 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 
> 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 
> 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)
> == SQL ==
> struct
> --^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64)
>   at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123)
>   at 
> 

[jira] [Assigned] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31065:
-

Assignee: Hyukjin Kwon

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 
> 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 
> 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 
> 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)
> == SQL ==
> struct
> --^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64)
>   at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777)
>   at 
> 

[jira] [Commented] (SPARK-30707) Lead/Lag window function throws AnalysisException without ORDER BY clause

2020-03-10 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055642#comment-17055642
 ] 

angerszhu commented on SPARK-30707:
---

add pr in [https://github.com/apache/spark/pull/27861]

> Lead/Lag window function throws AnalysisException without ORDER BY clause
> -
>
> Key: SPARK-30707
> URL: https://issues.apache.org/jira/browse/SPARK-30707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
>  Lead/Lag window function throws AnalysisException without ORDER BY clause:
> {code:java}
> SELECT lead(ten, four + 1) OVER (PARTITION BY four), ten, four
> FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s
> org.apache.spark.sql.AnalysisException
> Window function lead(ten#x, (four#x + 1), null) requires window to be 
> ordered, please add ORDER BY clause. For example SELECT lead(ten#x, (four#x + 
> 1), null)(value_expr) OVER (PARTITION BY window_partition ORDER BY 
> window_ordering) from table;
> {code}
>  
> Maybe we need fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31096) Replace `Array` with `Seq` in AQE `CustomShuffleReaderExec`

2020-03-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31096.
-
Fix Version/s: 3.0.0
 Assignee: Wei Xue
   Resolution: Fixed

> Replace `Array` with `Seq` in AQE `CustomShuffleReaderExec`
> ---
>
> Key: SPARK-31096
> URL: https://issues.apache.org/jira/browse/SPARK-31096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29183:
--
Summary: Upgrade JDK 11 Installation to 11.0.6  (was: Upgrade JDK 11 
Installation to 11.0.4)

> Upgrade JDK 11 Installation to 11.0.6
> -
>
> Key: SPARK-29183
> URL: https://issues.apache.org/jira/browse/SPARK-29183
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Every JDK 11.0.x releases have many fixes including performance regression 
> fix. We had better upgrade it to the latest 11.0.4.
> - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29199) Add linters and license/dependency checkers to GitHub Action

2020-03-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29199.
---
Fix Version/s: 3.0.0
 Assignee: Dongjoon Hyun
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25879

> Add linters and license/dependency checkers to GitHub Action
> 
>
> Key: SPARK-29199
> URL: https://issues.apache.org/jira/browse/SPARK-29199
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org