[jira] [Assigned] (SPARK-28196) Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28196:


Assignee: Apache Spark

> Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog
> ---
>
> Key: SPARK-28196
> URL: https://issues.apache.org/jira/browse/SPARK-28196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28196) Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28196:


Assignee: (was: Apache Spark)

> Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog
> ---
>
> Key: SPARK-28196
> URL: https://issues.apache.org/jira/browse/SPARK-28196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28196) Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog

2019-06-27 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28196:

Description: 
 
{code:scala}
def listTables(db: String, pattern: String, includeLocalTempViews: Boolean): 
Seq[TableIdentifier]
def listLocalTempViews(pattern: String): Seq[TableIdentifier]
{code}
Because in some cases {{listTables}} does not need local temporary view and 
sometimes only need list local temporary view.

> Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog
> ---
>
> Key: SPARK-28196
> URL: https://issues.apache.org/jira/browse/SPARK-28196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
>  
> {code:scala}
> def listTables(db: String, pattern: String, includeLocalTempViews: Boolean): 
> Seq[TableIdentifier]
> def listLocalTempViews(pattern: String): Seq[TableIdentifier]
> {code}
> Because in some cases {{listTables}} does not need local temporary view and 
> sometimes only need list local temporary view.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28196) Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog

2019-06-27 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28196:

Summary: Add a new `listTables` and `listLocalTempViews` APIs for 
SessionCatalog  (was: SessionCatalog#listTables support does not list local 
temporary views)

> Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog
> ---
>
> Key: SPARK-28196
> URL: https://issues.apache.org/jira/browse/SPARK-28196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28196) SessionCatalog#listTables support does not list local temporary views

2019-06-27 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28196:
---

 Summary: SessionCatalog#listTables support does not list local 
temporary views
 Key: SPARK-28196
 URL: https://issues.apache.org/jira/browse/SPARK-28196
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28133) Hyperbolic Functions

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28133:


Assignee: Apache Spark

> Hyperbolic Functions
> 
>
> Key: SPARK-28133
> URL: https://issues.apache.org/jira/browse/SPARK-28133
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> ||Function||Description||Example||Result||
> |{{sinh(_x_)}}|hyperbolic sine|{{sinh(0)}}|{{0}}|
> |{{cosh(_x_)}}|hyperbolic cosine|{{cosh(0)}}|{{1}}|
> |{{tanh(_x_)}}|hyperbolic tangent|{{tanh(0)}}|{{0}}|
> |{{asinh(_x_)}}|inverse hyperbolic sine|{{asinh(0)}}|{{0}}|
> |{{acosh(_x_)}}|inverse hyperbolic cosine|{{acosh(1)}}|{{0}}|
> |{{atanh(_x_)}}|inverse hyperbolic tangent|{{atanh(0)}}|{{0}}|
>  
>  
> [https://www.postgresql.org/docs/12/functions-math.html#FUNCTIONS-MATH-HYP-TABLE]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28133) Hyperbolic Functions

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28133:


Assignee: (was: Apache Spark)

> Hyperbolic Functions
> 
>
> Key: SPARK-28133
> URL: https://issues.apache.org/jira/browse/SPARK-28133
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Description||Example||Result||
> |{{sinh(_x_)}}|hyperbolic sine|{{sinh(0)}}|{{0}}|
> |{{cosh(_x_)}}|hyperbolic cosine|{{cosh(0)}}|{{1}}|
> |{{tanh(_x_)}}|hyperbolic tangent|{{tanh(0)}}|{{0}}|
> |{{asinh(_x_)}}|inverse hyperbolic sine|{{asinh(0)}}|{{0}}|
> |{{acosh(_x_)}}|inverse hyperbolic cosine|{{acosh(1)}}|{{0}}|
> |{{atanh(_x_)}}|inverse hyperbolic tangent|{{atanh(0)}}|{{0}}|
>  
>  
> [https://www.postgresql.org/docs/12/functions-math.html#FUNCTIONS-MATH-HYP-TABLE]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17398) Failed to query on external JSon Partitioned table

2019-06-27 Thread bianqi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bianqi updated SPARK-17398:
---
Attachment: screenshot-1.png

> Failed to query on external JSon Partitioned table
> --
>
> Key: SPARK-17398
> URL: https://issues.apache.org/jira/browse/SPARK-17398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: screenshot-1.png
>
>
> 1. Create External Json partitioned table 
> with SerDe in hive-hcatalog-core-1.2.1.jar, download fom
> https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/1.2.1
> 2. Query table meet exception, which works in spark1.5.2
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: 
> Lost task
>  0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: 
> java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord
> at 
> org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:430)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>  
> 3. Test Code
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.hive.HiveContext
> object JsonBugs {
>   def main(args: Array[String]): Unit = {
> val table = "test_json"
> val location = "file:///g:/home/test/json"
> val create = s"""CREATE   EXTERNAL  TABLE  ${table}
>  (id string,  seq string )
>   PARTITIONED BY(index int)
>   ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
>   LOCATION "${location}" 
>   """
> val add_part = s"""
>  ALTER TABLE ${table} ADD 
>  PARTITION (index=1)LOCATION '${location}/index=1'
> """
> val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
> conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse")
> val ctx = new SparkContext(conf)
> val hctx = new HiveContext(ctx)
> val exist = hctx.tableNames().map { x => x.toLowerCase() }.contains(table)
> if (!exist) {
>   hctx.sql(create)
>   hctx.sql(add_part)
> } else {
>   hctx.sql("show partitions " + table).show()
> }
> hctx.sql("select * from test_json").show()
>   }
> }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17398) Failed to query on external JSon Partitioned table

2019-06-27 Thread bianqi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874652#comment-16874652
 ] 

bianqi commented on SPARK-17398:


 !screenshot-1.png!  

> Failed to query on external JSon Partitioned table
> --
>
> Key: SPARK-17398
> URL: https://issues.apache.org/jira/browse/SPARK-17398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: screenshot-1.png
>
>
> 1. Create External Json partitioned table 
> with SerDe in hive-hcatalog-core-1.2.1.jar, download fom
> https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/1.2.1
> 2. Query table meet exception, which works in spark1.5.2
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: 
> Lost task
>  0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: 
> java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord
> at 
> org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:430)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>  
> 3. Test Code
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.hive.HiveContext
> object JsonBugs {
>   def main(args: Array[String]): Unit = {
> val table = "test_json"
> val location = "file:///g:/home/test/json"
> val create = s"""CREATE   EXTERNAL  TABLE  ${table}
>  (id string,  seq string )
>   PARTITIONED BY(index int)
>   ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
>   LOCATION "${location}" 
>   """
> val add_part = s"""
>  ALTER TABLE ${table} ADD 
>  PARTITION (index=1)LOCATION '${location}/index=1'
> """
> val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
> conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse")
> val ctx = new SparkContext(conf)
> val hctx = new HiveContext(ctx)
> val exist = hctx.tableNames().map { x => x.toLowerCase() }.contains(table)
> if (!exist) {
>   hctx.sql(create)
>   hctx.sql(add_part)
> } else {
>   hctx.sql("show partitions " + table).show()
> }
> hctx.sql("select * from test_json").show()
>   }
> }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28194) [SQL] A NoSuchElementException maybe thrown when EnsureRequirement

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28194:


Assignee: Apache Spark

> [SQL] A NoSuchElementException maybe thrown when EnsureRequirement
> --
>
> Key: SPARK-28194
> URL: https://issues.apache.org/jira/browse/SPARK-28194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: feiwang
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:239)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:234)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:257)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:297)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28194) [SQL] A NoSuchElementException maybe thrown when EnsureRequirement

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28194:


Assignee: (was: Apache Spark)

> [SQL] A NoSuchElementException maybe thrown when EnsureRequirement
> --
>
> Key: SPARK-28194
> URL: https://issues.apache.org/jira/browse/SPARK-28194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: feiwang
>Priority: Major
>
> {code:java}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:239)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:234)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:257)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:297)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28195) CheckAnalysis not working for Command and report misleading error message

2019-06-27 Thread liupengcheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-28195:
-
Description: 
Currently, we encountered an issue when executing 
`InsertIntoDataSourceDirCommand`, and we found that it's query relied on 
non-exist table or view, but we finally got a misleading error message:
{code:java}
Caused by: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid 
call to dataType on unresolved object, tree: 'kr.objective_id
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:440)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:159)
at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:159)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:544)
at 
org.apache.spark.sql.execution.command.InsertIntoDataSourceDirCommand.run(InsertIntoDataSourceDirCommand.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at 
org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:246)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3277)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3276)
at org.apache.spark.sql.Dataset.init(Dataset.scala:190)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:277)
... 11 more

{code}
After looking into the code, I found that it's because we support 
`runSQLOnFiles` feature since 2.3, and if the table does not exist and it's not 
a temporary table, then It will be treated as running directly on files.

`ResolveSQLOnFile` rule will analyze it, and return an `UnresolvedRelation` on 
resolve failure(it's actually not a sql on files, so it will fail when 
resolving). Due to Command has empty children, `CheckAnalysis` will skip check 
the `UnresolvedRelation` and finally we got the above misleading error message 
when executing this command.

I think maybe we should checkAnalysis for command's query plan? Or is there any 
consideration for not checking analysis for command?

Seems this issue still exists in master branch. 

  was:
Currently, we encountered an issue when executing 
`InsertIntoDataSourceDirCommand`, and we found that it's query relied on 
non-exist table or view, but we finally got a misleading error message:
{code:java}
Caused by: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid 
call to dataType on unresolved object, tree: 'kr.objective_id
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:440)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:159)
at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:159)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:544)
at 

[jira] [Created] (SPARK-28195) CheckAnalysis not working for Command and report misleading error message

2019-06-27 Thread liupengcheng (JIRA)
liupengcheng created SPARK-28195:


 Summary: CheckAnalysis not working for Command and report 
misleading error message
 Key: SPARK-28195
 URL: https://issues.apache.org/jira/browse/SPARK-28195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: liupengcheng


Currently, we encountered an issue when executing 
`InsertIntoDataSourceDirCommand`, and we found that it's query relied on 
non-exist table or view, but we finally got a misleading error message:
{code:java}
Caused by: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid 
call to dataType on unresolved object, tree: 'kr.objective_id
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440)
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:440)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:159)
at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:159)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:544)
at 
org.apache.spark.sql.execution.command.InsertIntoDataSourceDirCommand.run(InsertIntoDataSourceDirCommand.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at 
org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:246)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3277)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3276)
at org.apache.spark.sql.Dataset.init(Dataset.scala:190)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:277)
... 11 more

{code}
After looking into the code, I found that it's because we support 
`runSQLOnFiles` feature since 2.3, and if the table does not exist and it's not 
a temporary table, then It will be treated as running directly on files.

`ResolveSQLOnFile` rule will analyze it, and return an `UnresolvedRelation` on 
resolve failure(it's actually not a sql on files, so it will fail when 
resolving). Due to Command has empty children, `CheckAnalysis` will skip check 
the `UnresolvedRelation` and finally we got the above misleading error message 
when executing this command.

I think maybe we should checkAnalysis for command's query plan? Or is there any 
consideration for not checking analysis for command?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28194) [SQL] A NoSuchElementException maybe thrown when EnsureRequirement

2019-06-27 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-28194:

Description: 

{code:java}
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:239)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:234)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:257)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:297)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
{code}


> [SQL] A NoSuchElementException maybe thrown when EnsureRequirement
> --
>
> Key: SPARK-28194
> URL: https://issues.apache.org/jira/browse/SPARK-28194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: feiwang
>Priority: Major
>
> {code:java}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:239)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:234)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:257)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:297)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> 

[jira] [Updated] (SPARK-28194) [SQL] A NoSuchElementException maybe thrown when EnsureRequirement

2019-06-27 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-28194:

Summary: [SQL] A NoSuchElementException maybe thrown when EnsureRequirement 
 (was: a NoSuchElementException maybe thrown when EnsureRequirement)

> [SQL] A NoSuchElementException maybe thrown when EnsureRequirement
> --
>
> Key: SPARK-28194
> URL: https://issues.apache.org/jira/browse/SPARK-28194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: feiwang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28194) a NoSuchElementException maybe thrown when EnsureRequirement

2019-06-27 Thread feiwang (JIRA)
feiwang created SPARK-28194:
---

 Summary: a NoSuchElementException maybe thrown when 
EnsureRequirement
 Key: SPARK-28194
 URL: https://issues.apache.org/jira/browse/SPARK-28194
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: feiwang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28179) Avoid hard-coded config: spark.sql.globalTempDatabase

2019-06-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28179:


Assignee: Yuming Wang

> Avoid hard-coded config: spark.sql.globalTempDatabase
> -
>
> Key: SPARK-28179
> URL: https://issues.apache.org/jira/browse/SPARK-28179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28179) Avoid hard-coded config: spark.sql.globalTempDatabase

2019-06-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28179.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24979
[https://github.com/apache/spark/pull/24979]

> Avoid hard-coded config: spark.sql.globalTempDatabase
> -
>
> Key: SPARK-28179
> URL: https://issues.apache.org/jira/browse/SPARK-28179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28193) toPandas() not working as expected in Apache Spark 2.4.0

2019-06-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874597#comment-16874597
 ] 

Hyukjin Kwon commented on SPARK-28193:
--

Seems like Arrow feature was enabled. Can you try it after setting 
{{spark.sql.execution.arrow.fallback.enabled}} false and share the error 
message?

> toPandas() not working as expected in Apache Spark 2.4.0
> 
>
> Key: SPARK-28193
> URL: https://issues.apache.org/jira/browse/SPARK-28193
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.2
> Environment: Databricks 5.3 Apache Spark 2.4.0
>Reporter: SUSHMIT ROY
>Priority: Minor
>
>  I am in a databricks environment and using Pyspark2.4.0 but still, the 
> topandas is taking a lot of time. Any ideas on what can be causing the issue? 
> Also, I am getting a warning like {{UserWarning: pyarrow.open_stream is 
> deprecated, please use pyarrow.ipc.open_stream,although I have upgraded to 
> pyarrows0.13.0}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28188) Materialize Dataframe API

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28188:


Assignee: Apache Spark

> Materialize Dataframe API 
> --
>
> Key: SPARK-28188
> URL: https://issues.apache.org/jira/browse/SPARK-28188
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Vinitha Reddy Gankidi
>Assignee: Apache Spark
>Priority: Major
>
> We have added a new API to materialize dataframes and our internal users have 
> found it very useful. For use cases where you need to do different 
> computations on the same dataframe, Spark recomputes the dataframe each time. 
> This is problematic if evaluation of the dataframe is expensive.
> Materialize is a Spark action. It is a way to let Spark explicitly know that 
> the dataframe has already been computed. Once a dataframe is materialized, 
> Spark skips all stages prior to the materialize when the dataframe is reused 
> later on.
> Spark may scan the same table twice if two queries load different columns. 
> For example, the following two queries would scan the same data twice:
> {code:java}
> val tab = spark.table("some_table").filter("c LIKE '%match%'")
> val num_groups = tab.agg(distinctCount($"a"))
> val groups_with_b = tab.groupBy($"a").agg(min($"b") as "min"){code}
>  
> The same table is scanned twice because Spark doesn't know it should load b 
> when the first query runs. You can use materialize to load and then reuse the 
> data:
> {code:java}
> val materialized = spark.table("some_table").filter("c LIKE '%match%'")
> .select($"a", $"b").repartition($"a").materialize()
> val num_groups = materialized.agg(distinctCount($"a"))
> val groups_with_b = materialized.groupBy($"a").agg(min($"b") as "min"){code}
>  
> This uses select to filter out columns that don't need to be loaded. Without 
> this, Spark doesn't know that only a and b are going to be used later.
> This example also uses repartition to add a shuffle because Spark resumes 
> from the last shuffle. In most cases you may need to repartition the 
> dataframe before materializing it in order to skip the expensive stages as 
> repartition introduces a new stage. 
> h3. Materialize vs Cache:
>  * Caching/Persisting of dataframes is lazy. The first time the dataset is 
> computed in an action, it will be kept in memory on the nodes. Materialize is 
> an action that runs a job that produces the rows of data that a data frame 
> represents, and returns a new data frame with the result. When the result 
> data frame is used, Spark resumes execution using the data from the last 
> shuffle.
>  * By reusing shuffle data, materialized data is served by the cluster's 
> persistent shuffle servers instead of Spark executors. This makes materialize 
> more reliable. Caching on the other hand happens in the executor where the 
> task runs and data could be lost if executors time out from inactivity or run 
> out of memory.
>  * Since materialize is more reliable and uses fewer resources than cache, it 
> is usually a better choice for batch workloads. But, for processing that 
> iterates over a dataset many times, it is better to keep the data in memory 
> using cache or persist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28188) Materialize Dataframe API

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28188:


Assignee: (was: Apache Spark)

> Materialize Dataframe API 
> --
>
> Key: SPARK-28188
> URL: https://issues.apache.org/jira/browse/SPARK-28188
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Vinitha Reddy Gankidi
>Priority: Major
>
> We have added a new API to materialize dataframes and our internal users have 
> found it very useful. For use cases where you need to do different 
> computations on the same dataframe, Spark recomputes the dataframe each time. 
> This is problematic if evaluation of the dataframe is expensive.
> Materialize is a Spark action. It is a way to let Spark explicitly know that 
> the dataframe has already been computed. Once a dataframe is materialized, 
> Spark skips all stages prior to the materialize when the dataframe is reused 
> later on.
> Spark may scan the same table twice if two queries load different columns. 
> For example, the following two queries would scan the same data twice:
> {code:java}
> val tab = spark.table("some_table").filter("c LIKE '%match%'")
> val num_groups = tab.agg(distinctCount($"a"))
> val groups_with_b = tab.groupBy($"a").agg(min($"b") as "min"){code}
>  
> The same table is scanned twice because Spark doesn't know it should load b 
> when the first query runs. You can use materialize to load and then reuse the 
> data:
> {code:java}
> val materialized = spark.table("some_table").filter("c LIKE '%match%'")
> .select($"a", $"b").repartition($"a").materialize()
> val num_groups = materialized.agg(distinctCount($"a"))
> val groups_with_b = materialized.groupBy($"a").agg(min($"b") as "min"){code}
>  
> This uses select to filter out columns that don't need to be loaded. Without 
> this, Spark doesn't know that only a and b are going to be used later.
> This example also uses repartition to add a shuffle because Spark resumes 
> from the last shuffle. In most cases you may need to repartition the 
> dataframe before materializing it in order to skip the expensive stages as 
> repartition introduces a new stage. 
> h3. Materialize vs Cache:
>  * Caching/Persisting of dataframes is lazy. The first time the dataset is 
> computed in an action, it will be kept in memory on the nodes. Materialize is 
> an action that runs a job that produces the rows of data that a data frame 
> represents, and returns a new data frame with the result. When the result 
> data frame is used, Spark resumes execution using the data from the last 
> shuffle.
>  * By reusing shuffle data, materialized data is served by the cluster's 
> persistent shuffle servers instead of Spark executors. This makes materialize 
> more reliable. Caching on the other hand happens in the executor where the 
> task runs and data could be lost if executors time out from inactivity or run 
> out of memory.
>  * Since materialize is more reliable and uses fewer resources than cache, it 
> is usually a better choice for batch workloads. But, for processing that 
> iterates over a dataset many times, it is better to keep the data in memory 
> using cache or persist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28187) Add hadoop-cloud module to PR builders

2019-06-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-28187:
--

Assignee: Marcelo Vanzin

> Add hadoop-cloud module to PR builders
> --
>
> Key: SPARK-28187
> URL: https://issues.apache.org/jira/browse/SPARK-28187
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> We currently don't build / test the hadoop-cloud stuff in PRs. See 
> https://github.com/apache/spark/pull/24970 for an example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28187) Add hadoop-cloud module to PR builders

2019-06-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-28187.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24987
[https://github.com/apache/spark/pull/24987]

> Add hadoop-cloud module to PR builders
> --
>
> Key: SPARK-28187
> URL: https://issues.apache.org/jira/browse/SPARK-28187
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> We currently don't build / test the hadoop-cloud stuff in PRs. See 
> https://github.com/apache/spark/pull/24970 for an example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28192) Data Source - State - Write side

2019-06-27 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874538#comment-16874538
 ] 

Jungtaek Lim edited comment on SPARK-28192 at 6/27/19 10:07 PM:


I realized new DSv2 (maybe old DSv2 too?) requires Dataframe to be partitioned 
correctly before putting sink. State writer is not the case, and unfortunately 
there's no storage coordinating this. It should repartition via key by itself, 
which could be possible with DSv1 (since it provides Dataframe to write) but no 
longer possible with DSv2.

[https://github.com/HeartSaVioR/spark-state-tools/blob/2f97f264186e852144e7ec3f9b2ab3dda4e45179/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L63-L75]

[~rdblue] [~cloud_fan] Which would be the best to address this? Would I need to 
wrap this with some method to handle repartition before adding to sink?


was (Author: kabhwan):
I realized new DSv2 (maybe old DSv2 too?) requires Dataframe to be partitioned 
correctly before putting sink. State writer is not the case, as there's no 
storage coordinating this. It should repartition via key by itself, which could 
be possible with DSv1 (since it provides Dataframe to write) but no longer 
possible with DSv2.

[https://github.com/HeartSaVioR/spark-state-tools/blob/2f97f264186e852144e7ec3f9b2ab3dda4e45179/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L63-L75]

[~rdblue] [~cloud_fan] Which would be the best to address this? Would I need to 
wrap this with some method to handle repartition before adding to sink?

> Data Source - State - Write side
> 
>
> Key: SPARK-28192
> URL: https://issues.apache.org/jira/browse/SPARK-28192
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the efforts on addressing batch write on state data source.
> It could include "state repartition" if it doesn't require huge effort for 
> new DSv2, but it can be also move out to separate issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28193) toPandas() not working as expected in Apache Spark 2.4.0

2019-06-27 Thread SUSHMIT ROY (JIRA)
SUSHMIT ROY created SPARK-28193:
---

 Summary: toPandas() not working as expected in Apache Spark 2.4.0
 Key: SPARK-28193
 URL: https://issues.apache.org/jira/browse/SPARK-28193
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.2
 Environment: Databricks 5.3 Apache Spark 2.4.0
Reporter: SUSHMIT ROY


 I am in a databricks environment and using Pyspark2.4.0 but still, the 
topandas is taking a lot of time. Any ideas on what can be causing the issue? 
Also, I am getting a warning like {{UserWarning: pyarrow.open_stream is 
deprecated, please use pyarrow.ipc.open_stream,although I have upgraded to 
pyarrows0.13.0}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28192) Data Source - State - Write side

2019-06-27 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874538#comment-16874538
 ] 

Jungtaek Lim commented on SPARK-28192:
--

I realized new DSv2 (maybe old DSv2 too?) requires Dataframe to be partitioned 
correctly before putting sink. State writer is not the case, as there's no 
storage coordinating this. It should repartition via key by itself, which could 
be possible with DSv1 (since it provides Dataframe to write) but no longer 
possible with DSv2.

[https://github.com/HeartSaVioR/spark-state-tools/blob/2f97f264186e852144e7ec3f9b2ab3dda4e45179/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L63-L75]

[~rdblue] [~cloud_fan] Which would be the best to address this? Would I need to 
wrap this with some method to handle repartition before adding to sink?

> Data Source - State - Write side
> 
>
> Key: SPARK-28192
> URL: https://issues.apache.org/jira/browse/SPARK-28192
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the efforts on addressing batch write on state data source.
> It could include "state repartition" if it doesn't require huge effort for 
> new DSv2, but it can be also move out to separate issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28191) Data Source - State - Read side

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28191:


Assignee: (was: Apache Spark)

> Data Source - State - Read side
> ---
>
> Key: SPARK-28191
> URL: https://issues.apache.org/jira/browse/SPARK-28191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the efforts on addressing batch read on state data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28191) Data Source - State - Read side

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28191:


Assignee: Apache Spark

> Data Source - State - Read side
> ---
>
> Key: SPARK-28191
> URL: https://issues.apache.org/jira/browse/SPARK-28191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> This issue tracks the efforts on addressing batch read on state data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27959) Change YARN resource configs to use .amount

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27959:


Assignee: (was: Apache Spark)

> Change YARN resource configs to use .amount
> ---
>
> Key: SPARK-27959
> URL: https://issues.apache.org/jira/browse/SPARK-27959
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> we are adding in generic resource support into spark where we have suffix for 
> the amount of the resource so that we could support other configs. 
> Spark on yarn already had added configs to request resources via the configs 
> spark.yarn.\{executor/driver/am}.resource=, where the  amont> is value and unit together.  We should change those configs to have a 
> .amount suffix on them to match the spark configs and to allow future configs 
> to be more easily added. YARN itself already supports tags and attributes so 
> if we want the user to be able to pass those from spark at some point having 
> a suffix makes sense.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27959) Change YARN resource configs to use .amount

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27959:


Assignee: Apache Spark

> Change YARN resource configs to use .amount
> ---
>
> Key: SPARK-27959
> URL: https://issues.apache.org/jira/browse/SPARK-27959
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> we are adding in generic resource support into spark where we have suffix for 
> the amount of the resource so that we could support other configs. 
> Spark on yarn already had added configs to request resources via the configs 
> spark.yarn.\{executor/driver/am}.resource=, where the  amont> is value and unit together.  We should change those configs to have a 
> .amount suffix on them to match the spark configs and to allow future configs 
> to be more easily added. YARN itself already supports tags and attributes so 
> if we want the user to be able to pass those from spark at some point having 
> a suffix makes sense.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28192) Data Source - State - Write side

2019-06-27 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-28192:


 Summary: Data Source - State - Write side
 Key: SPARK-28192
 URL: https://issues.apache.org/jira/browse/SPARK-28192
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue tracks the efforts on addressing batch write on state data source.

It could include "state repartition" if it doesn't require huge effort for new 
DSv2, but it can be also move out to separate issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28191) Data Source - State - Read side

2019-06-27 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-28191:


 Summary: Data Source - State - Read side
 Key: SPARK-28191
 URL: https://issues.apache.org/jira/browse/SPARK-28191
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue tracks the efforts on addressing batch read on state data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28190) Data Source - State

2019-06-27 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874522#comment-16874522
 ] 

Jungtaek Lim commented on SPARK-28190:
--

While I'll create couple of sub-issues soon, please also let me know if we 
would like to apply SPIP process for this. Thanks for your interest on this!

> Data Source - State
> ---
>
> Key: SPARK-28190
> URL: https://issues.apache.org/jira/browse/SPARK-28190
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> "State" is becoming one of most important data on most of streaming 
> frameworks, which makes us getting continuous result of the query. In other 
> words, query could be no longer valid once state is corrupted or lost.
> Ideally we could run the query from the first of data to construct a 
> brand-new state for current query, but in reality it may not be possible for 
> many reasons, like input data source having retention, lots of resource waste 
> to rerun from start, etc.
>  
> There're other cases which end users want to deal with state, like creating 
> initial state from existing data via batch query (given batch query could be 
> far more efficient and faster).
> I'd like to propose a new data source which handles "state" in batch query, 
> enabling read and write on state.
> Allowing state read brings couple of benefits:
>  * You can analyze the state from "outside" of your streaming query
>  * It could be useful when there's something which can be derived from 
> existing state of existing query - note that state is not designed to be 
> shared among multiple queries
> Allowing state (re)write brings couple of major benefits:
>  * State can be repartitioned physically
>  * Schema in state can be changed, which means you don't need to run the 
> query from the start when the query should be changed
>  * You can remove state rows if you want, like reducing size, removing 
> corrupt, etc.
>  * You can bootstrap state in your new query with existing data efficiently, 
> don't need to run streaming query from the start point
> Btw, basically I'm planning to contribute my own works 
> ([https://github.com/HeartSaVioR/spark-state-tools]), so for many of 
> sub-issues it would require not-too-much amount of efforts to submit patches. 
> I'll try to apply new DSv2, so it could be a major effort while preparing to 
> donate code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28190) Data Source - State

2019-06-27 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-28190:


 Summary: Data Source - State
 Key: SPARK-28190
 URL: https://issues.apache.org/jira/browse/SPARK-28190
 Project: Spark
  Issue Type: Umbrella
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


"State" is becoming one of most important data on most of streaming frameworks, 
which makes us getting continuous result of the query. In other words, query 
could be no longer valid once state is corrupted or lost.

Ideally we could run the query from the first of data to construct a brand-new 
state for current query, but in reality it may not be possible for many 
reasons, like input data source having retention, lots of resource waste to 
rerun from start, etc.

 

There're other cases which end users want to deal with state, like creating 
initial state from existing data via batch query (given batch query could be 
far more efficient and faster).

I'd like to propose a new data source which handles "state" in batch query, 
enabling read and write on state.

Allowing state read brings couple of benefits:
 * You can analyze the state from "outside" of your streaming query
 * It could be useful when there's something which can be derived from existing 
state of existing query - note that state is not designed to be shared among 
multiple queries

Allowing state (re)write brings couple of major benefits:
 * State can be repartitioned physically
 * Schema in state can be changed, which means you don't need to run the query 
from the start when the query should be changed
 * You can remove state rows if you want, like reducing size, removing corrupt, 
etc.
 * You can bootstrap state in your new query with existing data efficiently, 
don't need to run streaming query from the start point

Btw, basically I'm planning to contribute my own works 
([https://github.com/HeartSaVioR/spark-state-tools]), so for many of sub-issues 
it would require not-too-much amount of efforts to submit patches. I'll try to 
apply new DSv2, so it could be a major effort while preparing to donate code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-06-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26985:

Labels: BigEndian correctness  (was: BigEndian)

> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Assignee: ketan kunde
>Priority: Major
>  Labels: BigEndian, correctness
> Fix For: 3.0.0
>
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt, access only some column of the all of 
> columns.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
> cache() works perfectly fine if double and  float values are not in picture.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables

2019-06-27 Thread Luke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke updated SPARK-28189:
-
Description: 
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(vals1, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']
# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop('caps') # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 

  was:
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(vals1, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']
# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop(caps) # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 


> Pyspark  - df.drop() is Case Sensitive when Referring to Upstream Tables
> 
>
> Key: SPARK-28189
> URL: https://issues.apache.org/jira/browse/SPARK-28189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Luke
>Priority: Minor
>
> Column names in general are case insensitive in Pyspark, and df.drop() in 
> general is also case insensitive.
> However, when referring to an upstream table, such as from a join, e.g.
> {code:java}
> vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
> df1 = spark.createDataFrame(vals1, ['KEY','field'])
> vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
> df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])
> df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
> {code}
>  
> drop will become case sensitive. e.g.
> {code:java}
> # from above, df1 consists of columns ['KEY', 'field']
> # from above, df2 consists of columns ['KEY', 'CAPS']
> df_joined.select(df2['key']) # will give a result
> df_joined.drop('caps') # will also give a result
> {code}
> however, note the following
> {code:java}
> df_joined.drop(df2['key']) # no-op
> df_joined.drop(df2['caps']) # no-op
> df_joined.drop(df2['KEY']) # will drop column as expected
> df_joined.drop(df2['CAPS']) # will drop column as expected
> {code}
>  
>  
> so in summary, using df.drop(df2['col']) doesn't align with expected case 
> insensitivity for column names, even though functions like select, join, and 
> dropping a column generally are case insensitive.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables

2019-06-27 Thread Luke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke updated SPARK-28189:
-
Description: 
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(vals1, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']
# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop(caps) # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 

  was:
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(vals1, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']

# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop(caps) # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 


> Pyspark  - df.drop() is Case Sensitive when Referring to Upstream Tables
> 
>
> Key: SPARK-28189
> URL: https://issues.apache.org/jira/browse/SPARK-28189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Luke
>Priority: Minor
>
> Column names in general are case insensitive in Pyspark, and df.drop() in 
> general is also case insensitive.
> However, when referring to an upstream table, such as from a join, e.g.
> {code:java}
> vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
> df1 = spark.createDataFrame(vals1, ['KEY','field'])
> vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
> df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])
> df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
> {code}
>  
> drop will become case sensitive. e.g.
> {code:java}
> # from above, df1 consists of columns ['KEY', 'field']
> # from above, df2 consists of columns ['KEY', 'CAPS']
> df_joined.select(df2['key']) # will give a result
> df_joined.drop(caps) # will also give a result
> {code}
> however, note the following
> {code:java}
> df_joined.drop(df2['key']) # no-op
> df_joined.drop(df2['caps']) # no-op
> df_joined.drop(df2['KEY']) # will drop column as expected
> df_joined.drop(df2['CAPS']) # will drop column as expected
> {code}
>  
>  
> so in summary, using df.drop(df2['col']) doesn't align with expected case 
> insensitivity for column names, even though functions like select, join, and 
> dropping a column generally are case insensitive.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables

2019-06-27 Thread Luke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke updated SPARK-28189:
-
Description: 
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(vals1, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']

# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop(caps) # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 

  was:
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(vals1, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] = df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']

# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop(caps) # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 


> Pyspark  - df.drop() is Case Sensitive when Referring to Upstream Tables
> 
>
> Key: SPARK-28189
> URL: https://issues.apache.org/jira/browse/SPARK-28189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Luke
>Priority: Minor
>
> Column names in general are case insensitive in Pyspark, and df.drop() in 
> general is also case insensitive.
> However, when referring to an upstream table, such as from a join, e.g.
> {code:java}
> vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
> df1 = spark.createDataFrame(vals1, ['KEY','field'])
> vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
> df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])
> df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
> {code}
>  
> drop will become case sensitive. e.g.
> {code:java}
> # from above, df1 consists of columns ['KEY', 'field']
> # from above, df2 consists of columns ['KEY', 'CAPS']
> df_joined.select(df2['key']) # will give a result
> df_joined.drop(caps) # will also give a result
> {code}
> however, note the following
> {code:java}
> df_joined.drop(df2['key']) # no-op
> df_joined.drop(df2['caps']) # no-op
> df_joined.drop(df2['KEY']) # will drop column as expected
> df_joined.drop(df2['CAPS']) # will drop column as expected
> {code}
>  
>  
> so in summary, using df.drop(df2['col']) doesn't align with expected case 
> insensitivity for column names, even though functions like select, join, and 
> dropping a column generally are case insensitive.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables

2019-06-27 Thread Luke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke updated SPARK-28189:
-
Description: 
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(valuesA, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(valuesB, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] = df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']

# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop(caps) # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 

  was:
Column names in general are case insensitive in Pyspark, and df.drop("col") in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(valuesA, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(valuesB, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] = df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']

# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop(caps) # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 


> Pyspark  - df.drop() is Case Sensitive when Referring to Upstream Tables
> 
>
> Key: SPARK-28189
> URL: https://issues.apache.org/jira/browse/SPARK-28189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Luke
>Priority: Minor
>
> Column names in general are case insensitive in Pyspark, and df.drop() in 
> general is also case insensitive.
> However, when referring to an upstream table, such as from a join, e.g.
> {code:java}
> vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
> df1 = spark.createDataFrame(valuesA, ['KEY','field'])
> vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
> df2 = spark.createDataFrame(valuesB, ['KEY','CAPS'])
> df_joined = df1.join(df2, df1['key'] = df2['key'], "left")
> {code}
>  
> drop will become case sensitive. e.g.
> {code:java}
> # from above, df1 consists of columns ['KEY', 'field']
> # from above, df2 consists of columns ['KEY', 'CAPS']
> df_joined.select(df2['key']) # will give a result
> df_joined.drop(caps) # will also give a result
> {code}
> however, note the following
> {code:java}
> df_joined.drop(df2['key']) # no-op
> df_joined.drop(df2['caps']) # no-op
> df_joined.drop(df2['KEY']) # will drop column as expected
> df_joined.drop(df2['CAPS']) # will drop column as expected
> {code}
>  
>  
> so in summary, using df.drop(df2['col']) doesn't align with expected case 
> insensitivity for column names, even though functions like select, join, and 
> dropping a column generally are case insensitive.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables

2019-06-27 Thread Luke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke updated SPARK-28189:
-
Summary: Pyspark  - df.drop() is Case Sensitive when Referring to Upstream 
Tables  (was: Pyspark  - df.drop is Case Sensitive when Referring to Upstream 
Tables)

> Pyspark  - df.drop() is Case Sensitive when Referring to Upstream Tables
> 
>
> Key: SPARK-28189
> URL: https://issues.apache.org/jira/browse/SPARK-28189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Luke
>Priority: Minor
>
> Column names in general are case insensitive in Pyspark, and df.drop("col") 
> in general is also case insensitive.
> However, when referring to an upstream table, such as from a join, e.g.
> {code:java}
> vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
> df1 = spark.createDataFrame(valuesA, ['KEY','field'])
> vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
> df2 = spark.createDataFrame(valuesB, ['KEY','CAPS'])
> df_joined = df1.join(df2, df1['key'] = df2['key'], "left")
> {code}
>  
> drop will become case sensitive. e.g.
> {code:java}
> # from above, df1 consists of columns ['KEY', 'field']
> # from above, df2 consists of columns ['KEY', 'CAPS']
> df_joined.select(df2['key']) # will give a result
> df_joined.drop(caps) # will also give a result
> {code}
> however, note the following
> {code:java}
> df_joined.drop(df2['key']) # no-op
> df_joined.drop(df2['caps']) # no-op
> df_joined.drop(df2['KEY']) # will drop column as expected
> df_joined.drop(df2['CAPS']) # will drop column as expected
> {code}
>  
>  
> so in summary, using df.drop(df2['col']) doesn't align with expected case 
> insensitivity for column names, even though functions like select, join, and 
> dropping a column generally are case insensitive.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables

2019-06-27 Thread Luke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke updated SPARK-28189:
-
Description: 
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(vals1, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] = df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']

# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop(caps) # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 

  was:
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(valuesA, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(valuesB, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] = df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']

# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop(caps) # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 


> Pyspark  - df.drop() is Case Sensitive when Referring to Upstream Tables
> 
>
> Key: SPARK-28189
> URL: https://issues.apache.org/jira/browse/SPARK-28189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Luke
>Priority: Minor
>
> Column names in general are case insensitive in Pyspark, and df.drop() in 
> general is also case insensitive.
> However, when referring to an upstream table, such as from a join, e.g.
> {code:java}
> vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
> df1 = spark.createDataFrame(vals1, ['KEY','field'])
> vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
> df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])
> df_joined = df1.join(df2, df1['key'] = df2['key'], "left")
> {code}
>  
> drop will become case sensitive. e.g.
> {code:java}
> # from above, df1 consists of columns ['KEY', 'field']
> # from above, df2 consists of columns ['KEY', 'CAPS']
> df_joined.select(df2['key']) # will give a result
> df_joined.drop(caps) # will also give a result
> {code}
> however, note the following
> {code:java}
> df_joined.drop(df2['key']) # no-op
> df_joined.drop(df2['caps']) # no-op
> df_joined.drop(df2['KEY']) # will drop column as expected
> df_joined.drop(df2['CAPS']) # will drop column as expected
> {code}
>  
>  
> so in summary, using df.drop(df2['col']) doesn't align with expected case 
> insensitivity for column names, even though functions like select, join, and 
> dropping a column generally are case insensitive.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28189) Pyspark - df.drop is Case Sensitive when Referring to Upstream Tables

2019-06-27 Thread Luke (JIRA)
Luke created SPARK-28189:


 Summary: Pyspark  - df.drop is Case Sensitive when Referring to 
Upstream Tables
 Key: SPARK-28189
 URL: https://issues.apache.org/jira/browse/SPARK-28189
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Luke


Column names in general are case insensitive in Pyspark, and df.drop("col") in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(valuesA, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(valuesB, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] = df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']

# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop(caps) # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth

2019-06-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-28150.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24955
[https://github.com/apache/spark/pull/24955]

> Failure to create multiple contexts in same JVM with Kerberos auth
> --
>
> Key: SPARK-28150
> URL: https://issues.apache.org/jira/browse/SPARK-28150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> Take the following small app that creates multiple contexts (not 
> concurrently):
> {code}
> from pyspark.context import SparkContext
> import time
> for i in range(2):
>   with SparkContext() as sc:
> pass
>   time.sleep(5)
> {code}
> This fails when kerberos (without dt renewal) is being used:
> {noformat}
> 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49)
> Caused by: 
> org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error 
> calling method hbase.pb.AuthenticationService.GetAuthenticationToken
> at 
> org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException):
>  org.apache.hadoop.hbase.security.AccessDeniedException: Token generation 
> only allowed for Kerberos authenticated clients
> at 
> org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:126)
> {noformat}
> If you enable dt renewal things work since the codes takes a slightly 
> different path when generating the initial delegation tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth

2019-06-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-28150:
--

Assignee: Marcelo Vanzin

> Failure to create multiple contexts in same JVM with Kerberos auth
> --
>
> Key: SPARK-28150
> URL: https://issues.apache.org/jira/browse/SPARK-28150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> Take the following small app that creates multiple contexts (not 
> concurrently):
> {code}
> from pyspark.context import SparkContext
> import time
> for i in range(2):
>   with SparkContext() as sc:
> pass
>   time.sleep(5)
> {code}
> This fails when kerberos (without dt renewal) is being used:
> {noformat}
> 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49)
> Caused by: 
> org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error 
> calling method hbase.pb.AuthenticationService.GetAuthenticationToken
> at 
> org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException):
>  org.apache.hadoop.hbase.security.AccessDeniedException: Token generation 
> only allowed for Kerberos authenticated clients
> at 
> org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:126)
> {noformat}
> If you enable dt renewal things work since the codes takes a slightly 
> different path when generating the initial delegation tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27871) LambdaVariable should use per-query unique IDs instead of globally unique IDs

2019-06-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27871.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> LambdaVariable should use per-query unique IDs instead of globally unique IDs
> -
>
> Key: SPARK-27871
> URL: https://issues.apache.org/jira/browse/SPARK-27871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28188) Materialize Dataframe API

2019-06-27 Thread Vinitha Reddy Gankidi (JIRA)
Vinitha Reddy Gankidi created SPARK-28188:
-

 Summary: Materialize Dataframe API 
 Key: SPARK-28188
 URL: https://issues.apache.org/jira/browse/SPARK-28188
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Vinitha Reddy Gankidi


We have added a new API to materialize dataframes and our internal users have 
found it very useful. For use cases where you need to do different computations 
on the same dataframe, Spark recomputes the dataframe each time. This is 
problematic if evaluation of the dataframe is expensive.

Materialize is a Spark action. It is a way to let Spark explicitly know that 
the dataframe has already been computed. Once a dataframe is materialized, 
Spark skips all stages prior to the materialize when the dataframe is reused 
later on.

Spark may scan the same table twice if two queries load different columns. For 
example, the following two queries would scan the same data twice:
{code:java}
val tab = spark.table("some_table").filter("c LIKE '%match%'")

val num_groups = tab.agg(distinctCount($"a"))

val groups_with_b = tab.groupBy($"a").agg(min($"b") as "min"){code}
 

The same table is scanned twice because Spark doesn't know it should load b 
when the first query runs. You can use materialize to load and then reuse the 
data:

{code:java}
val materialized = spark.table("some_table").filter("c LIKE '%match%'")

.select($"a", $"b").repartition($"a").materialize()

val num_groups = materialized.agg(distinctCount($"a"))

val groups_with_b = materialized.groupBy($"a").agg(min($"b") as "min"){code}
 

This uses select to filter out columns that don't need to be loaded. Without 
this, Spark doesn't know that only a and b are going to be used later.

This example also uses repartition to add a shuffle because Spark resumes from 
the last shuffle. In most cases you may need to repartition the dataframe 
before materializing it in order to skip the expensive stages as repartition 
introduces a new stage. 
h3. Materialize vs Cache:
 * Caching/Persisting of dataframes is lazy. The first time the dataset is 
computed in an action, it will be kept in memory on the nodes. Materialize is 
an action that runs a job that produces the rows of data that a data frame 
represents, and returns a new data frame with the result. When the result data 
frame is used, Spark resumes execution using the data from the last shuffle.
 * By reusing shuffle data, materialized data is served by the cluster's 
persistent shuffle servers instead of Spark executors. This makes materialize 
more reliable. Caching on the other hand happens in the executor where the task 
runs and data could be lost if executors time out from inactivity or run out of 
memory.
 * Since materialize is more reliable and uses fewer resources than cache, it 
is usually a better choice for batch workloads. But, for processing that 
iterates over a dataset many times, it is better to keep the data in memory 
using cache or persist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28157) Make SHS clear KVStore LogInfo for the blacklisted entries

2019-06-27 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-28157:

Fix Version/s: 2.4.4
   2.3.4

> Make SHS clear KVStore LogInfo for the blacklisted entries
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a blacklist for all event log files failed 
> once at reading. The blacklisted log files are released back after 
> CLEAN_INTERVAL_S .
> However, the files whose size don't changes are ignored forever because 
> shouldReloadLog return false always when the size is the same with the value 
> in KVStore. This is recovered only via SHS restart.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28187) Add hadoop-cloud module to PR builders

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28187:


Assignee: Apache Spark

> Add hadoop-cloud module to PR builders
> --
>
> Key: SPARK-28187
> URL: https://issues.apache.org/jira/browse/SPARK-28187
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> We currently don't build / test the hadoop-cloud stuff in PRs. See 
> https://github.com/apache/spark/pull/24970 for an example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28187) Add hadoop-cloud module to PR builders

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28187:


Assignee: (was: Apache Spark)

> Add hadoop-cloud module to PR builders
> --
>
> Key: SPARK-28187
> URL: https://issues.apache.org/jira/browse/SPARK-28187
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> We currently don't build / test the hadoop-cloud stuff in PRs. See 
> https://github.com/apache/spark/pull/24970 for an example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28187) Add hadoop-cloud module to PR builders

2019-06-27 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-28187:
--

 Summary: Add hadoop-cloud module to PR builders
 Key: SPARK-28187
 URL: https://issues.apache.org/jira/browse/SPARK-28187
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


We currently don't build / test the hadoop-cloud stuff in PRs. See 
https://github.com/apache/spark/pull/24970 for an example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27466) LEAD function with 'ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING' causes exception in Spark

2019-06-27 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874292#comment-16874292
 ] 

Bruce Robbins commented on SPARK-27466:
---

Hi [~hvanhovell] and/or [~yhuai], any comment on my previous comment?

> LEAD function with 'ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING' 
> causes exception in Spark
> ---
>
> Key: SPARK-27466
> URL: https://issues.apache.org/jira/browse/SPARK-27466
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
> Environment: Spark version 2.2.0.2.6.4.92-2
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
>Reporter: Zoltan
>Priority: Major
>
> *1. Create a table in Hive:*
>   
> {code:java}
>  CREATE TABLE tab1(
>    col1 varchar(1),
>    col2 varchar(1)
>   )
>  PARTITIONED BY (
>    col3 varchar(1)
>  )
>  LOCATION
>    'hdfs://server1/data/tab1'
> {code}
>  
>  *2. Query the Table in Spark:*
> *2.1: Simple query, no exception thrown:*
> {code:java}
> scala> spark.sql("SELECT * from schema1.tab1").show()
> +-+---++
> |col1|col2|col3|
> +-+---++
> +-+---++
> {code}
> *2.2.: Query causing exception:*
> {code:java}
> scala> spark.sql("*SELECT (LEAD(col1) OVER ( PARTITION BY col3 ORDER BY col1 
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING*)) from 
> schema1.tab1")
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Window Frame ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING must match the required frame ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING;
>    at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2219)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2215)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>    at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>    at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:258)
>    at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:258)
>    at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
>    at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289)
>    at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293)
>    at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>    at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>    at scala.collection.immutable.List.foreach(List.scala:381)
>    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>    at scala.collection.immutable.List.map(List.scala:285)
>    at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293)
>    at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>    at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
>    at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)
>    at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)
>    at 
> 

[jira] [Resolved] (SPARK-28174) Upgrade to Kafka 2.3.0

2019-06-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28174.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24976

> Upgrade to Kafka 2.3.0
> --
>
> Key: SPARK-28174
> URL: https://issues.apache.org/jira/browse/SPARK-28174
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue updates Kafka dependency to 2.3.0 to bring the following 9 
> client-side patches at least.
> - 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20fixVersion%20NOT%20IN%20(2.2.0%2C%202.2.1)%20AND%20component%20%3D%20clients
> The following is a full release note.
> - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28186) array_contains returns null instead of false when one of the items in the array is null

2019-06-27 Thread Alex Kushnir (JIRA)
Alex Kushnir created SPARK-28186:


 Summary: array_contains returns null instead of false when one of 
the items in the array is null
 Key: SPARK-28186
 URL: https://issues.apache.org/jira/browse/SPARK-28186
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Alex Kushnir


If array of items contains a null item when array_contains returns true if item 
is found but if item is not found it returns null instead of false

Seq(

(1, Seq("a", "b", "c")),

(2, Seq("a", "b", null, "c"))

).toDF("id", "vals").createOrReplaceTempView("tbl")


spark.sql("select id, vals, array_contains(vals, 'a') as has_a, 
array_contains(vals, 'd') as has_d from tbl").show
+---+--+-+-+
| id| vals |has_a|has_d|
+---+--+-+-+
| 1| [a, b, c]| true|false|
| 2|[a, b,, c]| true| null|
+---+--+-+-+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28186) array_contains returns null instead of false when one of the items in the array is null

2019-06-27 Thread Alex Kushnir (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Kushnir updated SPARK-28186:
-
Description: 
If array of items contains a null item then array_contains returns true if item 
is found but if item is not found it returns null instead of false

Seq(

(1, Seq("a", "b", "c")),

(2, Seq("a", "b", null, "c"))

).toDF("id", "vals").createOrReplaceTempView("tbl")

spark.sql("select id, vals, array_contains(vals, 'a') as has_a, 
array_contains(vals, 'd') as has_d from tbl").show
 ++-++--+
|id|vals|has_a|has_d|

++-++--+
|1|[a, b, c]|true|false|
|2|[a, b,, c]|true|null|

++-++--+

  was:
If array of items contains a null item when array_contains returns true if item 
is found but if item is not found it returns null instead of false

Seq(

(1, Seq("a", "b", "c")),

(2, Seq("a", "b", null, "c"))

).toDF("id", "vals").createOrReplaceTempView("tbl")


spark.sql("select id, vals, array_contains(vals, 'a') as has_a, 
array_contains(vals, 'd') as has_d from tbl").show
+---+--+-+-+
| id| vals |has_a|has_d|
+---+--+-+-+
| 1| [a, b, c]| true|false|
| 2|[a, b,, c]| true| null|
+---+--+-+-+


> array_contains returns null instead of false when one of the items in the 
> array is null
> ---
>
> Key: SPARK-28186
> URL: https://issues.apache.org/jira/browse/SPARK-28186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alex Kushnir
>Priority: Major
>
> If array of items contains a null item then array_contains returns true if 
> item is found but if item is not found it returns null instead of false
> Seq(
> (1, Seq("a", "b", "c")),
> (2, Seq("a", "b", null, "c"))
> ).toDF("id", "vals").createOrReplaceTempView("tbl")
> spark.sql("select id, vals, array_contains(vals, 'a') as has_a, 
> array_contains(vals, 'd') as has_d from tbl").show
>  ++-++--+
> |id|vals|has_a|has_d|
> ++-++--+
> |1|[a, b, c]|true|false|
> |2|[a, b,, c]|true|null|
> ++-++--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28185) Trigger pandas iterator UDF closing stuff when iterator stop early

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28185:


Assignee: Apache Spark

> Trigger pandas iterator UDF closing stuff when iterator stop early
> --
>
> Key: SPARK-28185
> URL: https://issues.apache.org/jira/browse/SPARK-28185
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>
> Fix the issue Pandas UDF closing stuff won't be triggered when iterator stop 
> early.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28185) Trigger pandas iterator UDF closing stuff when iterator stop early

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28185:


Assignee: (was: Apache Spark)

> Trigger pandas iterator UDF closing stuff when iterator stop early
> --
>
> Key: SPARK-28185
> URL: https://issues.apache.org/jira/browse/SPARK-28185
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Priority: Major
>
> Fix the issue Pandas UDF closing stuff won't be triggered when iterator stop 
> early.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28185) Trigger pandas iterator UDF closing stuff when iterator stop early

2019-06-27 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-28185:
--

 Summary: Trigger pandas iterator UDF closing stuff when iterator 
stop early
 Key: SPARK-28185
 URL: https://issues.apache.org/jira/browse/SPARK-28185
 Project: Spark
  Issue Type: Bug
  Components: ML, SQL
Affects Versions: 2.4.3
Reporter: Weichen Xu


Fix the issue Pandas UDF closing stuff won't be triggered when iterator stop 
early.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28184) Avoid creating new sessions in SparkMetadataOperationSuite

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28184:


Assignee: Apache Spark

> Avoid creating new sessions in SparkMetadataOperationSuite
> --
>
> Key: SPARK-28184
> URL: https://issues.apache.org/jira/browse/SPARK-28184
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28184) Avoid creating new sessions in SparkMetadataOperationSuite

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28184:


Assignee: (was: Apache Spark)

> Avoid creating new sessions in SparkMetadataOperationSuite
> --
>
> Key: SPARK-28184
> URL: https://issues.apache.org/jira/browse/SPARK-28184
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28183) Add a task status filter for taskList in REST API

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28183:


Assignee: Apache Spark

> Add a task status filter for taskList in REST API
> -
>
> Key: SPARK-28183
> URL: https://issues.apache.org/jira/browse/SPARK-28183
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> We have a scenario that our application needs to query failed tasks by REST 
> API {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} 
> when Spark job is running. In a large Stage, it may filter out dozens of 
> failed tasks from hundred thousands total tasks. It consumes much unnecessary 
> memory and time both in Spark and App side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28183) Add a task status filter for taskList in REST API

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28183:


Assignee: (was: Apache Spark)

> Add a task status filter for taskList in REST API
> -
>
> Key: SPARK-28183
> URL: https://issues.apache.org/jira/browse/SPARK-28183
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> We have a scenario that our application needs to query failed tasks by REST 
> API {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} 
> when Spark job is running. In a large Stage, it may filter out dozens of 
> failed tasks from hundred thousands total tasks. It consumes much unnecessary 
> memory and time both in Spark and App side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28184) Avoid creating new sessions in SparkMetadataOperationSuite

2019-06-27 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28184:

Summary: Avoid creating new sessions in SparkMetadataOperationSuite  (was: 
Avoid create new session in SparkMetadataOperationSuite)

> Avoid creating new sessions in SparkMetadataOperationSuite
> --
>
> Key: SPARK-28184
> URL: https://issues.apache.org/jira/browse/SPARK-28184
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28184) Avoid create new session in SparkMetadataOperationSuite

2019-06-27 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28184:
---

 Summary: Avoid create new session in SparkMetadataOperationSuite
 Key: SPARK-28184
 URL: https://issues.apache.org/jira/browse/SPARK-28184
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28183) Add a task status filter for taskList in REST API

2019-06-27 Thread Lantao Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-28183:
---
Description: 
We have a scenario that our application needs to query failed tasks by REST API 
{{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} when 
Spark job is running. In a large Stage, it may filter out dozens of failed 
tasks from hundred thousands total tasks. It consumes much unnecessary memory 
and time both in Spark and App side.




  was:
We have a scenario that our application needs to query failed tasks by REST API 
{{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} when 
Spark job is running. In a large Stage, it may contain hundred thousands tasks 
totally. Although it offers a pagination query via 
{{?offset=[offset]=[len]}}, it still faces two disadvantages:
1. App still needs to query out all tasks. It consumes much unnecessary memory 
and time both in Spark and App side.
2. Pagination query via {{?offset=[offset]=[len]}} makes the logic much 
complex and it still needs to handle all tasks.





> Add a task status filter for taskList in REST API
> -
>
> Key: SPARK-28183
> URL: https://issues.apache.org/jira/browse/SPARK-28183
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> We have a scenario that our application needs to query failed tasks by REST 
> API {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} 
> when Spark job is running. In a large Stage, it may filter out dozens of 
> failed tasks from hundred thousands total tasks. It consumes much unnecessary 
> memory and time both in Spark and App side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28183) Add a task status filter for taskList in REST API

2019-06-27 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-28183:
--

 Summary: Add a task status filter for taskList in REST API
 Key: SPARK-28183
 URL: https://issues.apache.org/jira/browse/SPARK-28183
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 3.0.0
Reporter: Lantao Jin


We have a scenario that our application needs to query failed tasks by REST API 
{{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} when 
Spark job is running. In a large Stage, it may contain hundred thousands tasks 
totally. Although it offers a pagination query via 
{{?offset=[offset]=[len]}}, it still faces two disadvantages:
1. App still needs to query out all tasks. It consumes much unnecessary memory 
and time both in Spark and App side.
2. Pagination query via {{?offset=[offset]=[len]}} makes the logic much 
complex and it still needs to handle all tasks.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27714:


Assignee: (was: Apache Spark)

> Support Join Reorder based on Genetic Algorithm when the # of joined tables > 
> 12
> 
>
> Key: SPARK-27714
> URL: https://issues.apache.org/jira/browse/SPARK-27714
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Major
>
> Now the join reorder logic is based on dynamic planning which can find the 
> most optimized plan theoretically, but the searching cost grows rapidly with 
> the # of joined tables grows. It would be better to introduce Genetic 
> algorithm (GA) to overcome this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27714:


Assignee: Apache Spark

> Support Join Reorder based on Genetic Algorithm when the # of joined tables > 
> 12
> 
>
> Key: SPARK-27714
> URL: https://issues.apache.org/jira/browse/SPARK-27714
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Assignee: Apache Spark
>Priority: Major
>
> Now the join reorder logic is based on dynamic planning which can find the 
> most optimized plan theoretically, but the searching cost grows rapidly with 
> the # of joined tables grows. It would be better to introduce Genetic 
> algorithm (GA) to overcome this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28182) Spark fails to download Hive 2.2+ jars from maven

2019-06-27 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874029#comment-16874029
 ] 

Emlyn Corrin commented on SPARK-28182:
--

I can work around it by adding --packages org.apache.zookeeper:zookeeper:3.4.6 
to the command line, or by setting an earlier Hive metastore version (it seems 
to work even with the default 1.2.1 jars when connecting to Hive metastore 
2.3.0).

> Spark fails to download Hive 2.2+ jars from maven
> -
>
> Key: SPARK-28182
> URL: https://issues.apache.org/jira/browse/SPARK-28182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Emlyn Corrin
>Priority: Major
>
> {{When starting Spark with spark.sql.hive.metastore.jars=maven and a 
> spark.sql.hive.metastore.version of 2.2 or 2.3 it fails to download the 
> required jars. It looks like it just downloads the -tests version of the 
> zookeeper 3.4.6 jar:}}
> {noformat}
> > rm -rf ~/.ivy2
> > pyspark --conf 
> > spark.hadoop.hive.metastore.uris=thrift://hive-metastore:1 --conf 
> > spark.sql.catalogImplementation=hive --conf 
> > spark.sql.hive.metastore.jars=maven --conf 
> > spark.sql.hive.metastore.version=2.3
> Python 3.7.3 (default, Mar 27 2019, 09:23:39)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help.
> 19/06/27 12:19:11 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 2.4.3
> /_/
> Using Python version 3.7.3 (default, Mar 27 2019 09:23:39)
> SparkSession available as 'spark'.
> In [1]: spark.sql('show databases').show(){noformat}
>  
> {noformat}
> http://www.datanucleus.org/downloads/maven2 added as a remote repository with 
> the name: repo-1
> Ivy Default Cache set to: /Users/emcorrin/.ivy2/cache
> The jars for the packages stored in: /Users/emcorrin/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/Cellar/apache-spark/2.4.3/libexec/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hive#hive-metastore added as a dependency
> org.apache.hive#hive-exec added as a dependency
> org.apache.hive#hive-common added as a dependency
> org.apache.hive#hive-serde added as a dependency
> com.google.guava#guava added as a dependency
> org.apache.hadoop#hadoop-client added as a dependency
> :: resolving dependencies :: 
> org.apache.spark#spark-submit-parent-639456d4-9614-45c9-ad3c-567e4fa69f79;1.0
>  confs: [default]
>  found org.apache.hive#hive-metastore;2.3.3 in central
>  found org.apache.hive#hive-serde;2.3.3 in central
>  found org.apache.hive#hive-common;2.3.3 in central
>  found org.apache.hive#hive-shims;2.3.3 in central
>  found org.apache.hive.shims#hive-shims-common;2.3.3 in central
>  found org.apache.logging.log4j#log4j-slf4j-impl;2.6.2 in central
>  found org.slf4j#slf4j-api;1.7.10 in central
>  found com.google.guava#guava;14.0.1 in central
>  found commons-lang#commons-lang;2.6 in central
>  found org.apache.thrift#libthrift;0.9.3 in central
>  found org.apache.httpcomponents#httpclient;4.4 in central
>  found org.apache.httpcomponents#httpcore;4.4 in central
>  found commons-logging#commons-logging;1.2 in central
>  found commons-codec#commons-codec;1.4 in central
>  found org.apache.zookeeper#zookeeper;3.4.6 in central
>  found org.slf4j#slf4j-log4j12;1.6.1 in central
>  found log4j#log4j;1.2.16 in central
>  found jline#jline;2.12 in central
>  found io.netty#netty;3.7.0.Final in central
>  found org.apache.hive.shims#hive-shims-0.23;2.3.3 in central
>  found org.apache.hadoop#hadoop-yarn-server-resourcemanager;2.7.2 in central
>  found org.apache.hadoop#hadoop-annotations;2.7.2 in central
>  found com.google.inject.extensions#guice-servlet;3.0 in central
>  found com.google.inject#guice;3.0 in central
>  found javax.inject#javax.inject;1 in central
>  found aopalliance#aopalliance;1.0 in central
>  found org.sonatype.sisu.inject#cglib;2.2.1-v20090111 in central
>  found asm#asm;3.2 in central
>  found com.google.protobuf#protobuf-java;2.5.0 in central
>  found commons-io#commons-io;2.4 in central
>  found com.sun.jersey#jersey-json;1.14 in central
>  found org.codehaus.jettison#jettison;1.1 in central
>  found com.sun.xml.bind#jaxb-impl;2.2.3-1 in central
>  found javax.xml.bind#jaxb-api;2.2.2 in central
>  found javax.xml.stream#stax-api;1.0-2 in central
>  found javax.activation#activation;1.1 in central

[jira] [Created] (SPARK-28182) Spark fails to download Hive 2.2+ jars from maven

2019-06-27 Thread Emlyn Corrin (JIRA)
Emlyn Corrin created SPARK-28182:


 Summary: Spark fails to download Hive 2.2+ jars from maven
 Key: SPARK-28182
 URL: https://issues.apache.org/jira/browse/SPARK-28182
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Emlyn Corrin


{{When starting Spark with spark.sql.hive.metastore.jars=maven and a 
spark.sql.hive.metastore.version of 2.2 or 2.3 it fails to download the 
required jars. It looks like it just downloads the -tests version of the 
zookeeper 3.4.6 jar:}}
{noformat}
> rm -rf ~/.ivy2
> pyspark --conf spark.hadoop.hive.metastore.uris=thrift://hive-metastore:1 
> --conf spark.sql.catalogImplementation=hive --conf 
> spark.sql.hive.metastore.jars=maven --conf 
> spark.sql.hive.metastore.version=2.3

Python 3.7.3 (default, Mar 27 2019, 09:23:39)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help.
19/06/27 12:19:11 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
 __
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/

Using Python version 3.7.3 (default, Mar 27 2019 09:23:39)
SparkSession available as 'spark'.

In [1]: spark.sql('show databases').show(){noformat}
 
{noformat}
http://www.datanucleus.org/downloads/maven2 added as a remote repository with 
the name: repo-1
Ivy Default Cache set to: /Users/emcorrin/.ivy2/cache
The jars for the packages stored in: /Users/emcorrin/.ivy2/jars
:: loading settings :: url = 
jar:file:/usr/local/Cellar/apache-spark/2.4.3/libexec/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hive#hive-metastore added as a dependency
org.apache.hive#hive-exec added as a dependency
org.apache.hive#hive-common added as a dependency
org.apache.hive#hive-serde added as a dependency
com.google.guava#guava added as a dependency
org.apache.hadoop#hadoop-client added as a dependency
:: resolving dependencies :: 
org.apache.spark#spark-submit-parent-639456d4-9614-45c9-ad3c-567e4fa69f79;1.0
 confs: [default]
 found org.apache.hive#hive-metastore;2.3.3 in central
 found org.apache.hive#hive-serde;2.3.3 in central
 found org.apache.hive#hive-common;2.3.3 in central
 found org.apache.hive#hive-shims;2.3.3 in central
 found org.apache.hive.shims#hive-shims-common;2.3.3 in central
 found org.apache.logging.log4j#log4j-slf4j-impl;2.6.2 in central
 found org.slf4j#slf4j-api;1.7.10 in central
 found com.google.guava#guava;14.0.1 in central
 found commons-lang#commons-lang;2.6 in central
 found org.apache.thrift#libthrift;0.9.3 in central
 found org.apache.httpcomponents#httpclient;4.4 in central
 found org.apache.httpcomponents#httpcore;4.4 in central
 found commons-logging#commons-logging;1.2 in central
 found commons-codec#commons-codec;1.4 in central
 found org.apache.zookeeper#zookeeper;3.4.6 in central
 found org.slf4j#slf4j-log4j12;1.6.1 in central
 found log4j#log4j;1.2.16 in central
 found jline#jline;2.12 in central
 found io.netty#netty;3.7.0.Final in central
 found org.apache.hive.shims#hive-shims-0.23;2.3.3 in central
 found org.apache.hadoop#hadoop-yarn-server-resourcemanager;2.7.2 in central
 found org.apache.hadoop#hadoop-annotations;2.7.2 in central
 found com.google.inject.extensions#guice-servlet;3.0 in central
 found com.google.inject#guice;3.0 in central
 found javax.inject#javax.inject;1 in central
 found aopalliance#aopalliance;1.0 in central
 found org.sonatype.sisu.inject#cglib;2.2.1-v20090111 in central
 found asm#asm;3.2 in central
 found com.google.protobuf#protobuf-java;2.5.0 in central
 found commons-io#commons-io;2.4 in central
 found com.sun.jersey#jersey-json;1.14 in central
 found org.codehaus.jettison#jettison;1.1 in central
 found com.sun.xml.bind#jaxb-impl;2.2.3-1 in central
 found javax.xml.bind#jaxb-api;2.2.2 in central
 found javax.xml.stream#stax-api;1.0-2 in central
 found javax.activation#activation;1.1 in central
 found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
 found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
 found org.codehaus.jackson#jackson-jaxrs;1.9.13 in central
 found org.codehaus.jackson#jackson-xc;1.9.13 in central
 found com.sun.jersey.contribs#jersey-guice;1.9 in central
 found org.apache.hadoop#hadoop-yarn-common;2.7.2 in central
 found org.apache.hadoop#hadoop-yarn-api;2.7.2 in central
 found org.apache.commons#commons-compress;1.9 in central
 found org.mortbay.jetty#jetty-util;6.1.26 in central
 found com.sun.jersey#jersey-core;1.14 in central
 found com.sun.jersey#jersey-client;1.9 in central
 found 

[jira] [Updated] (SPARK-23179) Support option to throw exception if overflow occurs during Decimal arithmetic

2019-06-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23179:

Summary: Support option to throw exception if overflow occurs during 
Decimal arithmetic  (was: Support option to throw exception if overflow occurs)

> Support option to throw exception if overflow occurs during Decimal arithmetic
> --
>
> Key: SPARK-23179
> URL: https://issues.apache.org/jira/browse/SPARK-23179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> SQL ANSI 2011 states that in case of overflow during arithmetic operations, 
> an exception should be thrown. This is what most of the SQL DBs do (eg. 
> SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 
> is open to be SQL compliant.
> I propose to have a config option which allows to decide whether Spark should 
> behave according to SQL standards or in the current way (ie. returning NULL).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23179) Support option to throw exception if overflow occurs

2019-06-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23179.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 20350
[https://github.com/apache/spark/pull/20350]

> Support option to throw exception if overflow occurs
> 
>
> Key: SPARK-23179
> URL: https://issues.apache.org/jira/browse/SPARK-23179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> SQL ANSI 2011 states that in case of overflow during arithmetic operations, 
> an exception should be thrown. This is what most of the SQL DBs do (eg. 
> SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 
> is open to be SQL compliant.
> I propose to have a config option which allows to decide whether Spark should 
> behave according to SQL standards or in the current way (ie. returning NULL).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23179) Support option to throw exception if overflow occurs

2019-06-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23179:
---

Assignee: Marco Gaido

> Support option to throw exception if overflow occurs
> 
>
> Key: SPARK-23179
> URL: https://issues.apache.org/jira/browse/SPARK-23179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
>
> SQL ANSI 2011 states that in case of overflow during arithmetic operations, 
> an exception should be thrown. This is what most of the SQL DBs do (eg. 
> SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 
> is open to be SQL compliant.
> I propose to have a config option which allows to decide whether Spark should 
> behave according to SQL standards or in the current way (ie. returning NULL).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28181) Add a filter interface to KVStore to speed up the entities retrieve

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28181:


Assignee: Apache Spark

> Add a filter interface to KVStore to speed up the entities retrieve
> ---
>
> Key: SPARK-28181
> URL: https://issues.apache.org/jira/browse/SPARK-28181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> Current entities in KVStore only could be retrieved all or none. This ticket 
> adding a filter interface to KVStore to speed up the entities retrieve.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28181) Add a filter interface to KVStore to speed up the entities retrieve

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28181:


Assignee: (was: Apache Spark)

> Add a filter interface to KVStore to speed up the entities retrieve
> ---
>
> Key: SPARK-28181
> URL: https://issues.apache.org/jira/browse/SPARK-28181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Current entities in KVStore only could be retrieved all or none. This ticket 
> adding a filter interface to KVStore to speed up the entities retrieve.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28181) Add a filter interface to KVStore to speed up the entities retrieve

2019-06-27 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-28181:
--

 Summary: Add a filter interface to KVStore to speed up the 
entities retrieve
 Key: SPARK-28181
 URL: https://issues.apache.org/jira/browse/SPARK-28181
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Lantao Jin


Current entities in KVStore only could be retrieved all or none. This ticket 
adding a filter interface to KVStore to speed up the entities retrieve.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11

2019-06-27 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873905#comment-16873905
 ] 

Stavros Kontopoulos commented on SPARK-24417:
-

How about transitive deps? Dont they need to support java 11? I spotted this 
one: [https://github.com/twitter/scrooge/pull/300] that is used by twitter 
chill. 

> Build and Run Spark on JDK11
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA for Apache Spark to support JDK11
> As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per 
> community discussion, we will skip JDK9 and 10 to support JDK 11 directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28178:


Assignee: (was: Apache Spark)

> DataSourceV2: DataFrameWriter.insertInfo
> 
>
> Key: SPARK-28178
> URL: https://issues.apache.org/jira/browse/SPARK-28178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo

2019-06-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28178:


Assignee: Apache Spark

> DataSourceV2: DataFrameWriter.insertInfo
> 
>
> Key: SPARK-28178
> URL: https://issues.apache.org/jira/browse/SPARK-28178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: Apache Spark
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset

2019-06-27 Thread M. Le Bihan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Le Bihan updated SPARK-28180:

Description: 
I am converting an _RDD_ spark program to a _Dataset_ one.
Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD 
of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing 
the conversion more simplier, by loading the CSV content into a Dataset 
and applying the _Encoders.bean(Entreprise.class)_ on it.


{code:java}
  Dataset csv = this.session.read().format("csv")
 .option("header","true").option("quote", "\"").option("escape", "\"")
 .load(source.getAbsolutePath())
 .selectExpr(
"ActivitePrincipaleUniteLegale as ActivitePrincipale",
"CAST(AnneeCategorieEntreprise as INTEGER) as 
AnneeCategorieEntreprise",
"CAST(AnneeEffectifsUniteLegale as INTEGER) as 
AnneeValiditeEffectifSalarie",
"CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as 
CaractereEmployeur",
"CategorieEntreprise", 
"CategorieJuridiqueUniteLegale as CategorieJuridique",
"DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as 
DateDebutHistorisation", "DateDernierTraitementUniteLegale as 
DateDernierTraitement",
"DenominationUniteLegale as Denomination", 
"DenominationUsuelle1UniteLegale as DenominationUsuelle1", 
"DenominationUsuelle2UniteLegale as DenominationUsuelle2", 
"DenominationUsuelle3UniteLegale as DenominationUsuelle3",
"CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as 
EconomieSocialeSolidaire",
"CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active",
"IdentifiantAssociationUniteLegale as IdentifiantAssociation",
"NicSiegeUniteLegale as NicSiege", 
"CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes",
"NomenclatureActivitePrincipaleUniteLegale as 
NomenclatureActivitePrincipale",
"NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage",
"Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", 
"Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", 
"PrenomUsuelUniteLegale as PrenomUsuel",
"PseudonymeUniteLegale as Pseudonyme",
"SexeUniteLegale as Sexe", 
"SigleUniteLegale as Sigle", 
"Siren", 
"TrancheEffectifsUniteLegale as TrancheEffectifSalarie"
 );
{code}

The _Dataset_ is succesfully created. But the following call of 
_Encoders.bean(Enterprise.class)_ fails :


{code:java}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
at 
fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178)
at 
test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at 

[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset

2019-06-27 Thread M. Le Bihan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Le Bihan updated SPARK-28180:

Description: 
I am converting an _RDD_ spark program to a _Dataset_ one.
Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD 
of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing 
the conversion more simplier, by loading the CSV content into a Dataset 
and applying the _Encoders.bean(Entreprise.class)_ on it.


{code:java}
  Dataset csv = this.session.read().format("csv")
 .option("header","true").option("quote", "\"").option("escape", "\"")
 .load(source.getAbsolutePath())
 .selectExpr(
"ActivitePrincipaleUniteLegale as ActivitePrincipale",
"CAST(AnneeCategorieEntreprise as INTEGER) as 
AnneeCategorieEntreprise",
"CAST(AnneeEffectifsUniteLegale as INTEGER) as 
AnneeValiditeEffectifSalarie",
"CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as 
CaractereEmployeur",
"CategorieEntreprise", 
"CategorieJuridiqueUniteLegale as CategorieJuridique",
"DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as 
DateDebutHistorisation", "DateDernierTraitementUniteLegale as 
DateDernierTraitement",
"DenominationUniteLegale as Denomination", 
"DenominationUsuelle1UniteLegale as DenominationUsuelle1", 
"DenominationUsuelle2UniteLegale as DenominationUsuelle2", 
"DenominationUsuelle3UniteLegale as DenominationUsuelle3",
"CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as 
EconomieSocialeSolidaire",
"CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active",
"IdentifiantAssociationUniteLegale as IdentifiantAssociation",
"NicSiegeUniteLegale as NicSiege", 
"CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes",
"NomenclatureActivitePrincipaleUniteLegale as 
NomenclatureActivitePrincipale",
"NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage",
"Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", 
"Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", 
"PrenomUsuelUniteLegale as PrenomUsuel",
"PseudonymeUniteLegale as Pseudonyme",
"SexeUniteLegale as Sexe", 
"SigleUniteLegale as Sigle", 
"Siren", 
"TrancheEffectifsUniteLegale as TrancheEffectifSalarie"
 );
{code}

The _Dataset_ is succesfully created. But the following call of 
_Encoders.bean(Enterprise.class)_ fails :


{code:java}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
at 
fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178)
at 
test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at 

[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset

2019-06-27 Thread M. Le Bihan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Le Bihan updated SPARK-28180:

Description: 
I am converting an _RDD_ spark program to a _Dataset_ one.
Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD 
of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing 
the conversion more simplier, by loading the CSV content into a Dataset 
and applying the _Encoders.bean(Entreprise.class)_ on it.


{code:java}
  Dataset csv = this.session.read().format("csv")
 .option("header","true").option("quote", "\"").option("escape", "\"")
 .load(source.getAbsolutePath())
 .selectExpr(
"ActivitePrincipaleUniteLegale as ActivitePrincipale",
"CAST(AnneeCategorieEntreprise as INTEGER) as 
AnneeCategorieEntreprise",
"CAST(AnneeEffectifsUniteLegale as INTEGER) as 
AnneeValiditeEffectifSalarie",
"CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as 
CaractereEmployeur",
"CategorieEntreprise", 
"CategorieJuridiqueUniteLegale as CategorieJuridique",
"DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as 
DateDebutHistorisation", "DateDernierTraitementUniteLegale as 
DateDernierTraitement",
"DenominationUniteLegale as Denomination", 
"DenominationUsuelle1UniteLegale as DenominationUsuelle1", 
"DenominationUsuelle2UniteLegale as DenominationUsuelle2", 
"DenominationUsuelle3UniteLegale as DenominationUsuelle3",
"CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as 
EconomieSocialeSolidaire",
"CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active",
"IdentifiantAssociationUniteLegale as IdentifiantAssociation",
"NicSiegeUniteLegale as NicSiege", 
"CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes",
"NomenclatureActivitePrincipaleUniteLegale as 
NomenclatureActivitePrincipale",
"NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage",
"Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", 
"Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", 
"PrenomUsuelUniteLegale as PrenomUsuel",
"PseudonymeUniteLegale as Pseudonyme",
"SexeUniteLegale as Sexe", 
"SigleUniteLegale as Sigle", 
"Siren", 
"TrancheEffectifsUniteLegale as TrancheEffectifSalarie"
 );
{code}

The _Dataset_ is succesfully created. But the following call of 
_Encoders.bean(Enterprise.class)_ fails :


{code:java}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
at 
fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178)
at 
test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at 

[jira] [Created] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset

2019-06-27 Thread M. Le Bihan (JIRA)
M. Le Bihan created SPARK-28180:
---

 Summary: Encoding CSV to Pojo works with Encoders.bean on RDD but 
fail on asserts when attemtping it from a Dataset
 Key: SPARK-28180
 URL: https://issues.apache.org/jira/browse/SPARK-28180
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
 Environment: Debian 9, Java 8.
Reporter: M. Le Bihan


I am converting an _RDD_ spark program to a _Dataset_ one.
Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD 
of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing 
the conversion more simplier, by loading the CSV content into a Dataset 
and applying the _Encoders.bean(Entreprise.class)_ on it.


{code:java}
  Dataset csv = this.session.read().format("csv")
 .option("header","true").option("quote", "\"").option("escape", "\"")
 .load(source.getAbsolutePath())
 .selectExpr(
"ActivitePrincipaleUniteLegale as ActivitePrincipale",
"CAST(AnneeCategorieEntreprise as INTEGER) as 
AnneeCategorieEntreprise",
"CAST(AnneeEffectifsUniteLegale as INTEGER) as 
AnneeValiditeEffectifSalarie",
"CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as 
CaractereEmployeur",
"CategorieEntreprise", 
"CategorieJuridiqueUniteLegale as CategorieJuridique",
"DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as 
DateDebutHistorisation", "DateDernierTraitementUniteLegale as 
DateDernierTraitement",
"DenominationUniteLegale as Denomination", 
"DenominationUsuelle1UniteLegale as DenominationUsuelle1", 
"DenominationUsuelle2UniteLegale as DenominationUsuelle2", 
"DenominationUsuelle3UniteLegale as DenominationUsuelle3",
"CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as 
EconomieSocialeSolidaire",
"CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active",
"IdentifiantAssociationUniteLegale as IdentifiantAssociation",
"NicSiegeUniteLegale as NicSiege", 
"CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes",
"NomenclatureActivitePrincipaleUniteLegale as 
NomenclatureActivitePrincipale",
"NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage",
"Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", 
"Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", 
"PrenomUsuelUniteLegale as PrenomUsuel",
"PseudonymeUniteLegale as Pseudonyme",
"SexeUniteLegale as Sexe", 
"SigleUniteLegale as Sigle", 
"Siren", 
"TrancheEffectifsUniteLegale as TrancheEffectifSalarie"
 );
{code}

The _Dataset_ is succesfully created. But the following call of 
_Encoders.bean(Enterprise.class)_ fails :


{code:java}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
at 
fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178)
at 
test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at