[jira] [Assigned] (SPARK-28196) Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28196: Assignee: Apache Spark > Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog > --- > > Key: SPARK-28196 > URL: https://issues.apache.org/jira/browse/SPARK-28196 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28196) Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28196: Assignee: (was: Apache Spark) > Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog > --- > > Key: SPARK-28196 > URL: https://issues.apache.org/jira/browse/SPARK-28196 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28196) Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28196: Description: {code:scala} def listTables(db: String, pattern: String, includeLocalTempViews: Boolean): Seq[TableIdentifier] def listLocalTempViews(pattern: String): Seq[TableIdentifier] {code} Because in some cases {{listTables}} does not need local temporary view and sometimes only need list local temporary view. > Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog > --- > > Key: SPARK-28196 > URL: https://issues.apache.org/jira/browse/SPARK-28196 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > > {code:scala} > def listTables(db: String, pattern: String, includeLocalTempViews: Boolean): > Seq[TableIdentifier] > def listLocalTempViews(pattern: String): Seq[TableIdentifier] > {code} > Because in some cases {{listTables}} does not need local temporary view and > sometimes only need list local temporary view. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28196) Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28196: Summary: Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog (was: SessionCatalog#listTables support does not list local temporary views) > Add a new `listTables` and `listLocalTempViews` APIs for SessionCatalog > --- > > Key: SPARK-28196 > URL: https://issues.apache.org/jira/browse/SPARK-28196 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28196) SessionCatalog#listTables support does not list local temporary views
Yuming Wang created SPARK-28196: --- Summary: SessionCatalog#listTables support does not list local temporary views Key: SPARK-28196 URL: https://issues.apache.org/jira/browse/SPARK-28196 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28133) Hyperbolic Functions
[ https://issues.apache.org/jira/browse/SPARK-28133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28133: Assignee: Apache Spark > Hyperbolic Functions > > > Key: SPARK-28133 > URL: https://issues.apache.org/jira/browse/SPARK-28133 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > ||Function||Description||Example||Result|| > |{{sinh(_x_)}}|hyperbolic sine|{{sinh(0)}}|{{0}}| > |{{cosh(_x_)}}|hyperbolic cosine|{{cosh(0)}}|{{1}}| > |{{tanh(_x_)}}|hyperbolic tangent|{{tanh(0)}}|{{0}}| > |{{asinh(_x_)}}|inverse hyperbolic sine|{{asinh(0)}}|{{0}}| > |{{acosh(_x_)}}|inverse hyperbolic cosine|{{acosh(1)}}|{{0}}| > |{{atanh(_x_)}}|inverse hyperbolic tangent|{{atanh(0)}}|{{0}}| > > > [https://www.postgresql.org/docs/12/functions-math.html#FUNCTIONS-MATH-HYP-TABLE] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28133) Hyperbolic Functions
[ https://issues.apache.org/jira/browse/SPARK-28133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28133: Assignee: (was: Apache Spark) > Hyperbolic Functions > > > Key: SPARK-28133 > URL: https://issues.apache.org/jira/browse/SPARK-28133 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Description||Example||Result|| > |{{sinh(_x_)}}|hyperbolic sine|{{sinh(0)}}|{{0}}| > |{{cosh(_x_)}}|hyperbolic cosine|{{cosh(0)}}|{{1}}| > |{{tanh(_x_)}}|hyperbolic tangent|{{tanh(0)}}|{{0}}| > |{{asinh(_x_)}}|inverse hyperbolic sine|{{asinh(0)}}|{{0}}| > |{{acosh(_x_)}}|inverse hyperbolic cosine|{{acosh(1)}}|{{0}}| > |{{atanh(_x_)}}|inverse hyperbolic tangent|{{atanh(0)}}|{{0}}| > > > [https://www.postgresql.org/docs/12/functions-math.html#FUNCTIONS-MATH-HYP-TABLE] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17398) Failed to query on external JSon Partitioned table
[ https://issues.apache.org/jira/browse/SPARK-17398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bianqi updated SPARK-17398: --- Attachment: screenshot-1.png > Failed to query on external JSon Partitioned table > -- > > Key: SPARK-17398 > URL: https://issues.apache.org/jira/browse/SPARK-17398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: pin_zhang >Priority: Major > Fix For: 2.0.1 > > Attachments: screenshot-1.png > > > 1. Create External Json partitioned table > with SerDe in hive-hcatalog-core-1.2.1.jar, download fom > https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/1.2.1 > 2. Query table meet exception, which works in spark1.5.2 > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: > Lost task > 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: > java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord > at > org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:430) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > > 3. Test Code > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.sql.hive.HiveContext > object JsonBugs { > def main(args: Array[String]): Unit = { > val table = "test_json" > val location = "file:///g:/home/test/json" > val create = s"""CREATE EXTERNAL TABLE ${table} > (id string, seq string ) > PARTITIONED BY(index int) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > LOCATION "${location}" > """ > val add_part = s""" > ALTER TABLE ${table} ADD > PARTITION (index=1)LOCATION '${location}/index=1' > """ > val conf = new SparkConf().setAppName("scala").setMaster("local[2]") > conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse") > val ctx = new SparkContext(conf) > val hctx = new HiveContext(ctx) > val exist = hctx.tableNames().map { x => x.toLowerCase() }.contains(table) > if (!exist) { > hctx.sql(create) > hctx.sql(add_part) > } else { > hctx.sql("show partitions " + table).show() > } > hctx.sql("select * from test_json").show() > } > } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17398) Failed to query on external JSon Partitioned table
[ https://issues.apache.org/jira/browse/SPARK-17398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874652#comment-16874652 ] bianqi commented on SPARK-17398: !screenshot-1.png! > Failed to query on external JSon Partitioned table > -- > > Key: SPARK-17398 > URL: https://issues.apache.org/jira/browse/SPARK-17398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: pin_zhang >Priority: Major > Fix For: 2.0.1 > > Attachments: screenshot-1.png > > > 1. Create External Json partitioned table > with SerDe in hive-hcatalog-core-1.2.1.jar, download fom > https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/1.2.1 > 2. Query table meet exception, which works in spark1.5.2 > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: > Lost task > 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: > java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord > at > org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:430) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > > 3. Test Code > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.sql.hive.HiveContext > object JsonBugs { > def main(args: Array[String]): Unit = { > val table = "test_json" > val location = "file:///g:/home/test/json" > val create = s"""CREATE EXTERNAL TABLE ${table} > (id string, seq string ) > PARTITIONED BY(index int) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > LOCATION "${location}" > """ > val add_part = s""" > ALTER TABLE ${table} ADD > PARTITION (index=1)LOCATION '${location}/index=1' > """ > val conf = new SparkConf().setAppName("scala").setMaster("local[2]") > conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse") > val ctx = new SparkContext(conf) > val hctx = new HiveContext(ctx) > val exist = hctx.tableNames().map { x => x.toLowerCase() }.contains(table) > if (!exist) { > hctx.sql(create) > hctx.sql(add_part) > } else { > hctx.sql("show partitions " + table).show() > } > hctx.sql("select * from test_json").show() > } > } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28194) [SQL] A NoSuchElementException maybe thrown when EnsureRequirement
[ https://issues.apache.org/jira/browse/SPARK-28194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28194: Assignee: Apache Spark > [SQL] A NoSuchElementException maybe thrown when EnsureRequirement > -- > > Key: SPARK-28194 > URL: https://issues.apache.org/jira/browse/SPARK-28194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: feiwang >Assignee: Apache Spark >Priority: Major > > {code:java} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:239) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:234) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:257) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:297) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28194) [SQL] A NoSuchElementException maybe thrown when EnsureRequirement
[ https://issues.apache.org/jira/browse/SPARK-28194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28194: Assignee: (was: Apache Spark) > [SQL] A NoSuchElementException maybe thrown when EnsureRequirement > -- > > Key: SPARK-28194 > URL: https://issues.apache.org/jira/browse/SPARK-28194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: feiwang >Priority: Major > > {code:java} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:239) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:234) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:257) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:297) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28195) CheckAnalysis not working for Command and report misleading error message
[ https://issues.apache.org/jira/browse/SPARK-28195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng updated SPARK-28195: - Description: Currently, we encountered an issue when executing `InsertIntoDataSourceDirCommand`, and we found that it's query relied on non-exist table or view, but we finally got a misleading error message: {code:java} Caused by: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'kr.objective_id at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:440) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:159) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:159) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:544) at org.apache.spark.sql.execution.command.InsertIntoDataSourceDirCommand.run(InsertIntoDataSourceDirCommand.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:246) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3277) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3276) at org.apache.spark.sql.Dataset.init(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:277) ... 11 more {code} After looking into the code, I found that it's because we support `runSQLOnFiles` feature since 2.3, and if the table does not exist and it's not a temporary table, then It will be treated as running directly on files. `ResolveSQLOnFile` rule will analyze it, and return an `UnresolvedRelation` on resolve failure(it's actually not a sql on files, so it will fail when resolving). Due to Command has empty children, `CheckAnalysis` will skip check the `UnresolvedRelation` and finally we got the above misleading error message when executing this command. I think maybe we should checkAnalysis for command's query plan? Or is there any consideration for not checking analysis for command? Seems this issue still exists in master branch. was: Currently, we encountered an issue when executing `InsertIntoDataSourceDirCommand`, and we found that it's query relied on non-exist table or view, but we finally got a misleading error message: {code:java} Caused by: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'kr.objective_id at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:440) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:159) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:159) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:544) at
[jira] [Created] (SPARK-28195) CheckAnalysis not working for Command and report misleading error message
liupengcheng created SPARK-28195: Summary: CheckAnalysis not working for Command and report misleading error message Key: SPARK-28195 URL: https://issues.apache.org/jira/browse/SPARK-28195 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2 Reporter: liupengcheng Currently, we encountered an issue when executing `InsertIntoDataSourceDirCommand`, and we found that it's query relied on non-exist table or view, but we finally got a misleading error message: {code:java} Caused by: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'kr.objective_id at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:440) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:440) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:159) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:159) at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:544) at org.apache.spark.sql.execution.command.InsertIntoDataSourceDirCommand.run(InsertIntoDataSourceDirCommand.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.execution.adaptive.QueryStage.executeCollect(QueryStage.scala:246) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3277) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3276) at org.apache.spark.sql.Dataset.init(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:277) ... 11 more {code} After looking into the code, I found that it's because we support `runSQLOnFiles` feature since 2.3, and if the table does not exist and it's not a temporary table, then It will be treated as running directly on files. `ResolveSQLOnFile` rule will analyze it, and return an `UnresolvedRelation` on resolve failure(it's actually not a sql on files, so it will fail when resolving). Due to Command has empty children, `CheckAnalysis` will skip check the `UnresolvedRelation` and finally we got the above misleading error message when executing this command. I think maybe we should checkAnalysis for command's query plan? Or is there any consideration for not checking analysis for command? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28194) [SQL] A NoSuchElementException maybe thrown when EnsureRequirement
[ https://issues.apache.org/jira/browse/SPARK-28194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-28194: Description: {code:java} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:239) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:234) at org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:257) at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:297) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) {code} > [SQL] A NoSuchElementException maybe thrown when EnsureRequirement > -- > > Key: SPARK-28194 > URL: https://issues.apache.org/jira/browse/SPARK-28194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: feiwang >Priority: Major > > {code:java} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:239) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$reorder$1.apply(EnsureRequirements.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:234) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:257) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:297) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at >
[jira] [Updated] (SPARK-28194) [SQL] A NoSuchElementException maybe thrown when EnsureRequirement
[ https://issues.apache.org/jira/browse/SPARK-28194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-28194: Summary: [SQL] A NoSuchElementException maybe thrown when EnsureRequirement (was: a NoSuchElementException maybe thrown when EnsureRequirement) > [SQL] A NoSuchElementException maybe thrown when EnsureRequirement > -- > > Key: SPARK-28194 > URL: https://issues.apache.org/jira/browse/SPARK-28194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: feiwang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28194) a NoSuchElementException maybe thrown when EnsureRequirement
feiwang created SPARK-28194: --- Summary: a NoSuchElementException maybe thrown when EnsureRequirement Key: SPARK-28194 URL: https://issues.apache.org/jira/browse/SPARK-28194 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2 Reporter: feiwang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28179) Avoid hard-coded config: spark.sql.globalTempDatabase
[ https://issues.apache.org/jira/browse/SPARK-28179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28179: Assignee: Yuming Wang > Avoid hard-coded config: spark.sql.globalTempDatabase > - > > Key: SPARK-28179 > URL: https://issues.apache.org/jira/browse/SPARK-28179 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28179) Avoid hard-coded config: spark.sql.globalTempDatabase
[ https://issues.apache.org/jira/browse/SPARK-28179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28179. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24979 [https://github.com/apache/spark/pull/24979] > Avoid hard-coded config: spark.sql.globalTempDatabase > - > > Key: SPARK-28179 > URL: https://issues.apache.org/jira/browse/SPARK-28179 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28193) toPandas() not working as expected in Apache Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-28193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874597#comment-16874597 ] Hyukjin Kwon commented on SPARK-28193: -- Seems like Arrow feature was enabled. Can you try it after setting {{spark.sql.execution.arrow.fallback.enabled}} false and share the error message? > toPandas() not working as expected in Apache Spark 2.4.0 > > > Key: SPARK-28193 > URL: https://issues.apache.org/jira/browse/SPARK-28193 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.2 > Environment: Databricks 5.3 Apache Spark 2.4.0 >Reporter: SUSHMIT ROY >Priority: Minor > > I am in a databricks environment and using Pyspark2.4.0 but still, the > topandas is taking a lot of time. Any ideas on what can be causing the issue? > Also, I am getting a warning like {{UserWarning: pyarrow.open_stream is > deprecated, please use pyarrow.ipc.open_stream,although I have upgraded to > pyarrows0.13.0}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28188) Materialize Dataframe API
[ https://issues.apache.org/jira/browse/SPARK-28188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28188: Assignee: Apache Spark > Materialize Dataframe API > -- > > Key: SPARK-28188 > URL: https://issues.apache.org/jira/browse/SPARK-28188 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Vinitha Reddy Gankidi >Assignee: Apache Spark >Priority: Major > > We have added a new API to materialize dataframes and our internal users have > found it very useful. For use cases where you need to do different > computations on the same dataframe, Spark recomputes the dataframe each time. > This is problematic if evaluation of the dataframe is expensive. > Materialize is a Spark action. It is a way to let Spark explicitly know that > the dataframe has already been computed. Once a dataframe is materialized, > Spark skips all stages prior to the materialize when the dataframe is reused > later on. > Spark may scan the same table twice if two queries load different columns. > For example, the following two queries would scan the same data twice: > {code:java} > val tab = spark.table("some_table").filter("c LIKE '%match%'") > val num_groups = tab.agg(distinctCount($"a")) > val groups_with_b = tab.groupBy($"a").agg(min($"b") as "min"){code} > > The same table is scanned twice because Spark doesn't know it should load b > when the first query runs. You can use materialize to load and then reuse the > data: > {code:java} > val materialized = spark.table("some_table").filter("c LIKE '%match%'") > .select($"a", $"b").repartition($"a").materialize() > val num_groups = materialized.agg(distinctCount($"a")) > val groups_with_b = materialized.groupBy($"a").agg(min($"b") as "min"){code} > > This uses select to filter out columns that don't need to be loaded. Without > this, Spark doesn't know that only a and b are going to be used later. > This example also uses repartition to add a shuffle because Spark resumes > from the last shuffle. In most cases you may need to repartition the > dataframe before materializing it in order to skip the expensive stages as > repartition introduces a new stage. > h3. Materialize vs Cache: > * Caching/Persisting of dataframes is lazy. The first time the dataset is > computed in an action, it will be kept in memory on the nodes. Materialize is > an action that runs a job that produces the rows of data that a data frame > represents, and returns a new data frame with the result. When the result > data frame is used, Spark resumes execution using the data from the last > shuffle. > * By reusing shuffle data, materialized data is served by the cluster's > persistent shuffle servers instead of Spark executors. This makes materialize > more reliable. Caching on the other hand happens in the executor where the > task runs and data could be lost if executors time out from inactivity or run > out of memory. > * Since materialize is more reliable and uses fewer resources than cache, it > is usually a better choice for batch workloads. But, for processing that > iterates over a dataset many times, it is better to keep the data in memory > using cache or persist. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28188) Materialize Dataframe API
[ https://issues.apache.org/jira/browse/SPARK-28188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28188: Assignee: (was: Apache Spark) > Materialize Dataframe API > -- > > Key: SPARK-28188 > URL: https://issues.apache.org/jira/browse/SPARK-28188 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Vinitha Reddy Gankidi >Priority: Major > > We have added a new API to materialize dataframes and our internal users have > found it very useful. For use cases where you need to do different > computations on the same dataframe, Spark recomputes the dataframe each time. > This is problematic if evaluation of the dataframe is expensive. > Materialize is a Spark action. It is a way to let Spark explicitly know that > the dataframe has already been computed. Once a dataframe is materialized, > Spark skips all stages prior to the materialize when the dataframe is reused > later on. > Spark may scan the same table twice if two queries load different columns. > For example, the following two queries would scan the same data twice: > {code:java} > val tab = spark.table("some_table").filter("c LIKE '%match%'") > val num_groups = tab.agg(distinctCount($"a")) > val groups_with_b = tab.groupBy($"a").agg(min($"b") as "min"){code} > > The same table is scanned twice because Spark doesn't know it should load b > when the first query runs. You can use materialize to load and then reuse the > data: > {code:java} > val materialized = spark.table("some_table").filter("c LIKE '%match%'") > .select($"a", $"b").repartition($"a").materialize() > val num_groups = materialized.agg(distinctCount($"a")) > val groups_with_b = materialized.groupBy($"a").agg(min($"b") as "min"){code} > > This uses select to filter out columns that don't need to be loaded. Without > this, Spark doesn't know that only a and b are going to be used later. > This example also uses repartition to add a shuffle because Spark resumes > from the last shuffle. In most cases you may need to repartition the > dataframe before materializing it in order to skip the expensive stages as > repartition introduces a new stage. > h3. Materialize vs Cache: > * Caching/Persisting of dataframes is lazy. The first time the dataset is > computed in an action, it will be kept in memory on the nodes. Materialize is > an action that runs a job that produces the rows of data that a data frame > represents, and returns a new data frame with the result. When the result > data frame is used, Spark resumes execution using the data from the last > shuffle. > * By reusing shuffle data, materialized data is served by the cluster's > persistent shuffle servers instead of Spark executors. This makes materialize > more reliable. Caching on the other hand happens in the executor where the > task runs and data could be lost if executors time out from inactivity or run > out of memory. > * Since materialize is more reliable and uses fewer resources than cache, it > is usually a better choice for batch workloads. But, for processing that > iterates over a dataset many times, it is better to keep the data in memory > using cache or persist. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28187) Add hadoop-cloud module to PR builders
[ https://issues.apache.org/jira/browse/SPARK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-28187: -- Assignee: Marcelo Vanzin > Add hadoop-cloud module to PR builders > -- > > Key: SPARK-28187 > URL: https://issues.apache.org/jira/browse/SPARK-28187 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > > We currently don't build / test the hadoop-cloud stuff in PRs. See > https://github.com/apache/spark/pull/24970 for an example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28187) Add hadoop-cloud module to PR builders
[ https://issues.apache.org/jira/browse/SPARK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-28187. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24987 [https://github.com/apache/spark/pull/24987] > Add hadoop-cloud module to PR builders > -- > > Key: SPARK-28187 > URL: https://issues.apache.org/jira/browse/SPARK-28187 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 3.0.0 > > > We currently don't build / test the hadoop-cloud stuff in PRs. See > https://github.com/apache/spark/pull/24970 for an example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28192) Data Source - State - Write side
[ https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874538#comment-16874538 ] Jungtaek Lim edited comment on SPARK-28192 at 6/27/19 10:07 PM: I realized new DSv2 (maybe old DSv2 too?) requires Dataframe to be partitioned correctly before putting sink. State writer is not the case, and unfortunately there's no storage coordinating this. It should repartition via key by itself, which could be possible with DSv1 (since it provides Dataframe to write) but no longer possible with DSv2. [https://github.com/HeartSaVioR/spark-state-tools/blob/2f97f264186e852144e7ec3f9b2ab3dda4e45179/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L63-L75] [~rdblue] [~cloud_fan] Which would be the best to address this? Would I need to wrap this with some method to handle repartition before adding to sink? was (Author: kabhwan): I realized new DSv2 (maybe old DSv2 too?) requires Dataframe to be partitioned correctly before putting sink. State writer is not the case, as there's no storage coordinating this. It should repartition via key by itself, which could be possible with DSv1 (since it provides Dataframe to write) but no longer possible with DSv2. [https://github.com/HeartSaVioR/spark-state-tools/blob/2f97f264186e852144e7ec3f9b2ab3dda4e45179/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L63-L75] [~rdblue] [~cloud_fan] Which would be the best to address this? Would I need to wrap this with some method to handle repartition before adding to sink? > Data Source - State - Write side > > > Key: SPARK-28192 > URL: https://issues.apache.org/jira/browse/SPARK-28192 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the efforts on addressing batch write on state data source. > It could include "state repartition" if it doesn't require huge effort for > new DSv2, but it can be also move out to separate issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28193) toPandas() not working as expected in Apache Spark 2.4.0
SUSHMIT ROY created SPARK-28193: --- Summary: toPandas() not working as expected in Apache Spark 2.4.0 Key: SPARK-28193 URL: https://issues.apache.org/jira/browse/SPARK-28193 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.2 Environment: Databricks 5.3 Apache Spark 2.4.0 Reporter: SUSHMIT ROY I am in a databricks environment and using Pyspark2.4.0 but still, the topandas is taking a lot of time. Any ideas on what can be causing the issue? Also, I am getting a warning like {{UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream,although I have upgraded to pyarrows0.13.0}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28192) Data Source - State - Write side
[ https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874538#comment-16874538 ] Jungtaek Lim commented on SPARK-28192: -- I realized new DSv2 (maybe old DSv2 too?) requires Dataframe to be partitioned correctly before putting sink. State writer is not the case, as there's no storage coordinating this. It should repartition via key by itself, which could be possible with DSv1 (since it provides Dataframe to write) but no longer possible with DSv2. [https://github.com/HeartSaVioR/spark-state-tools/blob/2f97f264186e852144e7ec3f9b2ab3dda4e45179/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L63-L75] [~rdblue] [~cloud_fan] Which would be the best to address this? Would I need to wrap this with some method to handle repartition before adding to sink? > Data Source - State - Write side > > > Key: SPARK-28192 > URL: https://issues.apache.org/jira/browse/SPARK-28192 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the efforts on addressing batch write on state data source. > It could include "state repartition" if it doesn't require huge effort for > new DSv2, but it can be also move out to separate issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28191) Data Source - State - Read side
[ https://issues.apache.org/jira/browse/SPARK-28191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28191: Assignee: (was: Apache Spark) > Data Source - State - Read side > --- > > Key: SPARK-28191 > URL: https://issues.apache.org/jira/browse/SPARK-28191 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the efforts on addressing batch read on state data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28191) Data Source - State - Read side
[ https://issues.apache.org/jira/browse/SPARK-28191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28191: Assignee: Apache Spark > Data Source - State - Read side > --- > > Key: SPARK-28191 > URL: https://issues.apache.org/jira/browse/SPARK-28191 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > This issue tracks the efforts on addressing batch read on state data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27959) Change YARN resource configs to use .amount
[ https://issues.apache.org/jira/browse/SPARK-27959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27959: Assignee: (was: Apache Spark) > Change YARN resource configs to use .amount > --- > > Key: SPARK-27959 > URL: https://issues.apache.org/jira/browse/SPARK-27959 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > we are adding in generic resource support into spark where we have suffix for > the amount of the resource so that we could support other configs. > Spark on yarn already had added configs to request resources via the configs > spark.yarn.\{executor/driver/am}.resource=, where the amont> is value and unit together. We should change those configs to have a > .amount suffix on them to match the spark configs and to allow future configs > to be more easily added. YARN itself already supports tags and attributes so > if we want the user to be able to pass those from spark at some point having > a suffix makes sense. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27959) Change YARN resource configs to use .amount
[ https://issues.apache.org/jira/browse/SPARK-27959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27959: Assignee: Apache Spark > Change YARN resource configs to use .amount > --- > > Key: SPARK-27959 > URL: https://issues.apache.org/jira/browse/SPARK-27959 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Major > > we are adding in generic resource support into spark where we have suffix for > the amount of the resource so that we could support other configs. > Spark on yarn already had added configs to request resources via the configs > spark.yarn.\{executor/driver/am}.resource=, where the amont> is value and unit together. We should change those configs to have a > .amount suffix on them to match the spark configs and to allow future configs > to be more easily added. YARN itself already supports tags and attributes so > if we want the user to be able to pass those from spark at some point having > a suffix makes sense. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28192) Data Source - State - Write side
Jungtaek Lim created SPARK-28192: Summary: Data Source - State - Write side Key: SPARK-28192 URL: https://issues.apache.org/jira/browse/SPARK-28192 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim This issue tracks the efforts on addressing batch write on state data source. It could include "state repartition" if it doesn't require huge effort for new DSv2, but it can be also move out to separate issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28191) Data Source - State - Read side
Jungtaek Lim created SPARK-28191: Summary: Data Source - State - Read side Key: SPARK-28191 URL: https://issues.apache.org/jira/browse/SPARK-28191 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim This issue tracks the efforts on addressing batch read on state data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28190) Data Source - State
[ https://issues.apache.org/jira/browse/SPARK-28190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874522#comment-16874522 ] Jungtaek Lim commented on SPARK-28190: -- While I'll create couple of sub-issues soon, please also let me know if we would like to apply SPIP process for this. Thanks for your interest on this! > Data Source - State > --- > > Key: SPARK-28190 > URL: https://issues.apache.org/jira/browse/SPARK-28190 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > "State" is becoming one of most important data on most of streaming > frameworks, which makes us getting continuous result of the query. In other > words, query could be no longer valid once state is corrupted or lost. > Ideally we could run the query from the first of data to construct a > brand-new state for current query, but in reality it may not be possible for > many reasons, like input data source having retention, lots of resource waste > to rerun from start, etc. > > There're other cases which end users want to deal with state, like creating > initial state from existing data via batch query (given batch query could be > far more efficient and faster). > I'd like to propose a new data source which handles "state" in batch query, > enabling read and write on state. > Allowing state read brings couple of benefits: > * You can analyze the state from "outside" of your streaming query > * It could be useful when there's something which can be derived from > existing state of existing query - note that state is not designed to be > shared among multiple queries > Allowing state (re)write brings couple of major benefits: > * State can be repartitioned physically > * Schema in state can be changed, which means you don't need to run the > query from the start when the query should be changed > * You can remove state rows if you want, like reducing size, removing > corrupt, etc. > * You can bootstrap state in your new query with existing data efficiently, > don't need to run streaming query from the start point > Btw, basically I'm planning to contribute my own works > ([https://github.com/HeartSaVioR/spark-state-tools]), so for many of > sub-issues it would require not-too-much amount of efforts to submit patches. > I'll try to apply new DSv2, so it could be a major effort while preparing to > donate code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28190) Data Source - State
Jungtaek Lim created SPARK-28190: Summary: Data Source - State Key: SPARK-28190 URL: https://issues.apache.org/jira/browse/SPARK-28190 Project: Spark Issue Type: Umbrella Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim "State" is becoming one of most important data on most of streaming frameworks, which makes us getting continuous result of the query. In other words, query could be no longer valid once state is corrupted or lost. Ideally we could run the query from the first of data to construct a brand-new state for current query, but in reality it may not be possible for many reasons, like input data source having retention, lots of resource waste to rerun from start, etc. There're other cases which end users want to deal with state, like creating initial state from existing data via batch query (given batch query could be far more efficient and faster). I'd like to propose a new data source which handles "state" in batch query, enabling read and write on state. Allowing state read brings couple of benefits: * You can analyze the state from "outside" of your streaming query * It could be useful when there's something which can be derived from existing state of existing query - note that state is not designed to be shared among multiple queries Allowing state (re)write brings couple of major benefits: * State can be repartitioned physically * Schema in state can be changed, which means you don't need to run the query from the start when the query should be changed * You can remove state rows if you want, like reducing size, removing corrupt, etc. * You can bootstrap state in your new query with existing data efficiently, don't need to run streaming query from the start point Btw, basically I'm planning to contribute my own works ([https://github.com/HeartSaVioR/spark-state-tools]), so for many of sub-issues it would require not-too-much amount of efforts to submit patches. I'll try to apply new DSv2, so it could be a major effort while preparing to donate code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian
[ https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26985: Labels: BigEndian correctness (was: BigEndian) > Test "access only some column of the all of columns " fails on big endian > - > > Key: SPARK-26985 > URL: https://issues.apache.org/jira/browse/SPARK-26985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Linux Ubuntu 16.04 > openjdk version "1.8.0_202" > OpenJDK Runtime Environment (build 1.8.0_202-b08) > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed > References 20190205_218 (JIT enabled, AOT enabled) > OpenJ9 - 90dd8cb40 > OMR - d2f4534b > JCL - d002501a90 based on jdk8u202-b08) > >Reporter: Anuja Jakhade >Assignee: ketan kunde >Priority: Major > Labels: BigEndian, correctness > Fix For: 3.0.0 > > Attachments: DataFrameTungstenSuite.txt, > InMemoryColumnarQuerySuite.txt, access only some column of the all of > columns.txt > > > While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am > observing test failures for 2 Suites of Project SQL. > 1. InMemoryColumnarQuerySuite > 2. DataFrameTungstenSuite > In both the cases test "access only some column of the all of columns" fails > due to mismatch in the final assert. > Observed that the data obtained after df.cache() is causing the error. Please > find attached the log with the details. > cache() works perfectly fine if double and float values are not in picture. > Inside test !!- access only some column of the all of columns *** FAILED > *** -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables
[ https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke updated SPARK-28189: - Description: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(vals1, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] == df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop('caps') # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. was: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(vals1, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] == df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. > Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables > > > Key: SPARK-28189 > URL: https://issues.apache.org/jira/browse/SPARK-28189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Luke >Priority: Minor > > Column names in general are case insensitive in Pyspark, and df.drop() in > general is also case insensitive. > However, when referring to an upstream table, such as from a join, e.g. > {code:java} > vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] > df1 = spark.createDataFrame(vals1, ['KEY','field']) > vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] > df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) > df_joined = df1.join(df2, df1['key'] == df2['key'], "left") > {code} > > drop will become case sensitive. e.g. > {code:java} > # from above, df1 consists of columns ['KEY', 'field'] > # from above, df2 consists of columns ['KEY', 'CAPS'] > df_joined.select(df2['key']) # will give a result > df_joined.drop('caps') # will also give a result > {code} > however, note the following > {code:java} > df_joined.drop(df2['key']) # no-op > df_joined.drop(df2['caps']) # no-op > df_joined.drop(df2['KEY']) # will drop column as expected > df_joined.drop(df2['CAPS']) # will drop column as expected > {code} > > > so in summary, using df.drop(df2['col']) doesn't align with expected case > insensitivity for column names, even though functions like select, join, and > dropping a column generally are case insensitive. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables
[ https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke updated SPARK-28189: - Description: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(vals1, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] == df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. was: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(vals1, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] == df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. > Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables > > > Key: SPARK-28189 > URL: https://issues.apache.org/jira/browse/SPARK-28189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Luke >Priority: Minor > > Column names in general are case insensitive in Pyspark, and df.drop() in > general is also case insensitive. > However, when referring to an upstream table, such as from a join, e.g. > {code:java} > vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] > df1 = spark.createDataFrame(vals1, ['KEY','field']) > vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] > df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) > df_joined = df1.join(df2, df1['key'] == df2['key'], "left") > {code} > > drop will become case sensitive. e.g. > {code:java} > # from above, df1 consists of columns ['KEY', 'field'] > # from above, df2 consists of columns ['KEY', 'CAPS'] > df_joined.select(df2['key']) # will give a result > df_joined.drop(caps) # will also give a result > {code} > however, note the following > {code:java} > df_joined.drop(df2['key']) # no-op > df_joined.drop(df2['caps']) # no-op > df_joined.drop(df2['KEY']) # will drop column as expected > df_joined.drop(df2['CAPS']) # will drop column as expected > {code} > > > so in summary, using df.drop(df2['col']) doesn't align with expected case > insensitivity for column names, even though functions like select, join, and > dropping a column generally are case insensitive. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables
[ https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke updated SPARK-28189: - Description: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(vals1, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] == df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. was: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(vals1, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] = df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. > Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables > > > Key: SPARK-28189 > URL: https://issues.apache.org/jira/browse/SPARK-28189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Luke >Priority: Minor > > Column names in general are case insensitive in Pyspark, and df.drop() in > general is also case insensitive. > However, when referring to an upstream table, such as from a join, e.g. > {code:java} > vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] > df1 = spark.createDataFrame(vals1, ['KEY','field']) > vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] > df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) > df_joined = df1.join(df2, df1['key'] == df2['key'], "left") > {code} > > drop will become case sensitive. e.g. > {code:java} > # from above, df1 consists of columns ['KEY', 'field'] > # from above, df2 consists of columns ['KEY', 'CAPS'] > df_joined.select(df2['key']) # will give a result > df_joined.drop(caps) # will also give a result > {code} > however, note the following > {code:java} > df_joined.drop(df2['key']) # no-op > df_joined.drop(df2['caps']) # no-op > df_joined.drop(df2['KEY']) # will drop column as expected > df_joined.drop(df2['CAPS']) # will drop column as expected > {code} > > > so in summary, using df.drop(df2['col']) doesn't align with expected case > insensitivity for column names, even though functions like select, join, and > dropping a column generally are case insensitive. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables
[ https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke updated SPARK-28189: - Description: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(valuesA, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(valuesB, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] = df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. was: Column names in general are case insensitive in Pyspark, and df.drop("col") in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(valuesA, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(valuesB, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] = df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. > Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables > > > Key: SPARK-28189 > URL: https://issues.apache.org/jira/browse/SPARK-28189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Luke >Priority: Minor > > Column names in general are case insensitive in Pyspark, and df.drop() in > general is also case insensitive. > However, when referring to an upstream table, such as from a join, e.g. > {code:java} > vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] > df1 = spark.createDataFrame(valuesA, ['KEY','field']) > vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] > df2 = spark.createDataFrame(valuesB, ['KEY','CAPS']) > df_joined = df1.join(df2, df1['key'] = df2['key'], "left") > {code} > > drop will become case sensitive. e.g. > {code:java} > # from above, df1 consists of columns ['KEY', 'field'] > # from above, df2 consists of columns ['KEY', 'CAPS'] > df_joined.select(df2['key']) # will give a result > df_joined.drop(caps) # will also give a result > {code} > however, note the following > {code:java} > df_joined.drop(df2['key']) # no-op > df_joined.drop(df2['caps']) # no-op > df_joined.drop(df2['KEY']) # will drop column as expected > df_joined.drop(df2['CAPS']) # will drop column as expected > {code} > > > so in summary, using df.drop(df2['col']) doesn't align with expected case > insensitivity for column names, even though functions like select, join, and > dropping a column generally are case insensitive. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables
[ https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke updated SPARK-28189: - Summary: Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables (was: Pyspark - df.drop is Case Sensitive when Referring to Upstream Tables) > Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables > > > Key: SPARK-28189 > URL: https://issues.apache.org/jira/browse/SPARK-28189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Luke >Priority: Minor > > Column names in general are case insensitive in Pyspark, and df.drop("col") > in general is also case insensitive. > However, when referring to an upstream table, such as from a join, e.g. > {code:java} > vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] > df1 = spark.createDataFrame(valuesA, ['KEY','field']) > vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] > df2 = spark.createDataFrame(valuesB, ['KEY','CAPS']) > df_joined = df1.join(df2, df1['key'] = df2['key'], "left") > {code} > > drop will become case sensitive. e.g. > {code:java} > # from above, df1 consists of columns ['KEY', 'field'] > # from above, df2 consists of columns ['KEY', 'CAPS'] > df_joined.select(df2['key']) # will give a result > df_joined.drop(caps) # will also give a result > {code} > however, note the following > {code:java} > df_joined.drop(df2['key']) # no-op > df_joined.drop(df2['caps']) # no-op > df_joined.drop(df2['KEY']) # will drop column as expected > df_joined.drop(df2['CAPS']) # will drop column as expected > {code} > > > so in summary, using df.drop(df2['col']) doesn't align with expected case > insensitivity for column names, even though functions like select, join, and > dropping a column generally are case insensitive. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables
[ https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke updated SPARK-28189: - Description: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(vals1, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] = df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. was: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(valuesA, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(valuesB, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] = df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. > Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables > > > Key: SPARK-28189 > URL: https://issues.apache.org/jira/browse/SPARK-28189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Luke >Priority: Minor > > Column names in general are case insensitive in Pyspark, and df.drop() in > general is also case insensitive. > However, when referring to an upstream table, such as from a join, e.g. > {code:java} > vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] > df1 = spark.createDataFrame(vals1, ['KEY','field']) > vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] > df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) > df_joined = df1.join(df2, df1['key'] = df2['key'], "left") > {code} > > drop will become case sensitive. e.g. > {code:java} > # from above, df1 consists of columns ['KEY', 'field'] > # from above, df2 consists of columns ['KEY', 'CAPS'] > df_joined.select(df2['key']) # will give a result > df_joined.drop(caps) # will also give a result > {code} > however, note the following > {code:java} > df_joined.drop(df2['key']) # no-op > df_joined.drop(df2['caps']) # no-op > df_joined.drop(df2['KEY']) # will drop column as expected > df_joined.drop(df2['CAPS']) # will drop column as expected > {code} > > > so in summary, using df.drop(df2['col']) doesn't align with expected case > insensitivity for column names, even though functions like select, join, and > dropping a column generally are case insensitive. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28189) Pyspark - df.drop is Case Sensitive when Referring to Upstream Tables
Luke created SPARK-28189: Summary: Pyspark - df.drop is Case Sensitive when Referring to Upstream Tables Key: SPARK-28189 URL: https://issues.apache.org/jira/browse/SPARK-28189 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: Luke Column names in general are case insensitive in Pyspark, and df.drop("col") in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(valuesA, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(valuesB, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] = df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop(caps) # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth
[ https://issues.apache.org/jira/browse/SPARK-28150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-28150. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24955 [https://github.com/apache/spark/pull/24955] > Failure to create multiple contexts in same JVM with Kerberos auth > -- > > Key: SPARK-28150 > URL: https://issues.apache.org/jira/browse/SPARK-28150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 3.0.0 > > > Take the following small app that creates multiple contexts (not > concurrently): > {code} > from pyspark.context import SparkContext > import time > for i in range(2): > with SparkContext() as sc: > pass > time.sleep(5) > {code} > This fails when kerberos (without dt renewal) is being used: > {noformat} > 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext. > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49) > Caused by: > org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error > calling method hbase.pb.AuthenticationService.GetAuthenticationToken > at > org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException): > org.apache.hadoop.hbase.security.AccessDeniedException: Token generation > only allowed for Kerberos authenticated clients > at > org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:126) > {noformat} > If you enable dt renewal things work since the codes takes a slightly > different path when generating the initial delegation tokens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28150) Failure to create multiple contexts in same JVM with Kerberos auth
[ https://issues.apache.org/jira/browse/SPARK-28150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-28150: -- Assignee: Marcelo Vanzin > Failure to create multiple contexts in same JVM with Kerberos auth > -- > > Key: SPARK-28150 > URL: https://issues.apache.org/jira/browse/SPARK-28150 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > > Take the following small app that creates multiple contexts (not > concurrently): > {code} > from pyspark.context import SparkContext > import time > for i in range(2): > with SparkContext() as sc: > pass > time.sleep(5) > {code} > This fails when kerberos (without dt renewal) is being used: > {noformat} > 19/06/24 11:33:58 ERROR spark.SparkContext: Error initializing SparkContext. > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:49) > Caused by: > org.apache.hadoop.hbase.shaded.com.google.protobuf.ServiceException: Error > calling method hbase.pb.AuthenticationService.GetAuthenticationToken > at > org.apache.hadoop.hbase.client.SyncCoprocessorRpcChannel.callBlockingMethod(SyncCoprocessorRpcChannel.java:71) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException): > org.apache.hadoop.hbase.security.AccessDeniedException: Token generation > only allowed for Kerberos authenticated clients > at > org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:126) > {noformat} > If you enable dt renewal things work since the codes takes a slightly > different path when generating the initial delegation tokens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27871) LambdaVariable should use per-query unique IDs instead of globally unique IDs
[ https://issues.apache.org/jira/browse/SPARK-27871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27871. - Resolution: Fixed Fix Version/s: 3.0.0 > LambdaVariable should use per-query unique IDs instead of globally unique IDs > - > > Key: SPARK-27871 > URL: https://issues.apache.org/jira/browse/SPARK-27871 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28188) Materialize Dataframe API
Vinitha Reddy Gankidi created SPARK-28188: - Summary: Materialize Dataframe API Key: SPARK-28188 URL: https://issues.apache.org/jira/browse/SPARK-28188 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.3 Reporter: Vinitha Reddy Gankidi We have added a new API to materialize dataframes and our internal users have found it very useful. For use cases where you need to do different computations on the same dataframe, Spark recomputes the dataframe each time. This is problematic if evaluation of the dataframe is expensive. Materialize is a Spark action. It is a way to let Spark explicitly know that the dataframe has already been computed. Once a dataframe is materialized, Spark skips all stages prior to the materialize when the dataframe is reused later on. Spark may scan the same table twice if two queries load different columns. For example, the following two queries would scan the same data twice: {code:java} val tab = spark.table("some_table").filter("c LIKE '%match%'") val num_groups = tab.agg(distinctCount($"a")) val groups_with_b = tab.groupBy($"a").agg(min($"b") as "min"){code} The same table is scanned twice because Spark doesn't know it should load b when the first query runs. You can use materialize to load and then reuse the data: {code:java} val materialized = spark.table("some_table").filter("c LIKE '%match%'") .select($"a", $"b").repartition($"a").materialize() val num_groups = materialized.agg(distinctCount($"a")) val groups_with_b = materialized.groupBy($"a").agg(min($"b") as "min"){code} This uses select to filter out columns that don't need to be loaded. Without this, Spark doesn't know that only a and b are going to be used later. This example also uses repartition to add a shuffle because Spark resumes from the last shuffle. In most cases you may need to repartition the dataframe before materializing it in order to skip the expensive stages as repartition introduces a new stage. h3. Materialize vs Cache: * Caching/Persisting of dataframes is lazy. The first time the dataset is computed in an action, it will be kept in memory on the nodes. Materialize is an action that runs a job that produces the rows of data that a data frame represents, and returns a new data frame with the result. When the result data frame is used, Spark resumes execution using the data from the last shuffle. * By reusing shuffle data, materialized data is served by the cluster's persistent shuffle servers instead of Spark executors. This makes materialize more reliable. Caching on the other hand happens in the executor where the task runs and data could be lost if executors time out from inactivity or run out of memory. * Since materialize is more reliable and uses fewer resources than cache, it is usually a better choice for batch workloads. But, for processing that iterates over a dataset many times, it is better to keep the data in memory using cache or persist. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28157) Make SHS clear KVStore LogInfo for the blacklisted entries
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-28157: Fix Version/s: 2.4.4 2.3.4 > Make SHS clear KVStore LogInfo for the blacklisted entries > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a blacklist for all event log files failed > once at reading. The blacklisted log files are released back after > CLEAN_INTERVAL_S . > However, the files whose size don't changes are ignored forever because > shouldReloadLog return false always when the size is the same with the value > in KVStore. This is recovered only via SHS restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28187) Add hadoop-cloud module to PR builders
[ https://issues.apache.org/jira/browse/SPARK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28187: Assignee: Apache Spark > Add hadoop-cloud module to PR builders > -- > > Key: SPARK-28187 > URL: https://issues.apache.org/jira/browse/SPARK-28187 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > We currently don't build / test the hadoop-cloud stuff in PRs. See > https://github.com/apache/spark/pull/24970 for an example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28187) Add hadoop-cloud module to PR builders
[ https://issues.apache.org/jira/browse/SPARK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28187: Assignee: (was: Apache Spark) > Add hadoop-cloud module to PR builders > -- > > Key: SPARK-28187 > URL: https://issues.apache.org/jira/browse/SPARK-28187 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > We currently don't build / test the hadoop-cloud stuff in PRs. See > https://github.com/apache/spark/pull/24970 for an example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28187) Add hadoop-cloud module to PR builders
Marcelo Vanzin created SPARK-28187: -- Summary: Add hadoop-cloud module to PR builders Key: SPARK-28187 URL: https://issues.apache.org/jira/browse/SPARK-28187 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Marcelo Vanzin We currently don't build / test the hadoop-cloud stuff in PRs. See https://github.com/apache/spark/pull/24970 for an example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27466) LEAD function with 'ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING' causes exception in Spark
[ https://issues.apache.org/jira/browse/SPARK-27466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874292#comment-16874292 ] Bruce Robbins commented on SPARK-27466: --- Hi [~hvanhovell] and/or [~yhuai], any comment on my previous comment? > LEAD function with 'ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING' > causes exception in Spark > --- > > Key: SPARK-27466 > URL: https://issues.apache.org/jira/browse/SPARK-27466 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.2.0 > Environment: Spark version 2.2.0.2.6.4.92-2 > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112) >Reporter: Zoltan >Priority: Major > > *1. Create a table in Hive:* > > {code:java} > CREATE TABLE tab1( > col1 varchar(1), > col2 varchar(1) > ) > PARTITIONED BY ( > col3 varchar(1) > ) > LOCATION > 'hdfs://server1/data/tab1' > {code} > > *2. Query the Table in Spark:* > *2.1: Simple query, no exception thrown:* > {code:java} > scala> spark.sql("SELECT * from schema1.tab1").show() > +-+---++ > |col1|col2|col3| > +-+---++ > +-+---++ > {code} > *2.2.: Query causing exception:* > {code:java} > scala> spark.sql("*SELECT (LEAD(col1) OVER ( PARTITION BY col3 ORDER BY col1 > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING*)) from > schema1.tab1") > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Window Frame ROWS BETWEEN UNBOUNDED > PRECEDING AND UNBOUNDED FOLLOWING must match the required frame ROWS BETWEEN > 1 FOLLOWING AND 1 FOLLOWING; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2219) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2215) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:258) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:258) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249) > at >
[jira] [Resolved] (SPARK-28174) Upgrade to Kafka 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28174. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24976 > Upgrade to Kafka 2.3.0 > -- > > Key: SPARK-28174 > URL: https://issues.apache.org/jira/browse/SPARK-28174 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > This issue updates Kafka dependency to 2.3.0 to bring the following 9 > client-side patches at least. > - > https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20fixVersion%20NOT%20IN%20(2.2.0%2C%202.2.1)%20AND%20component%20%3D%20clients > The following is a full release note. > - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28186) array_contains returns null instead of false when one of the items in the array is null
Alex Kushnir created SPARK-28186: Summary: array_contains returns null instead of false when one of the items in the array is null Key: SPARK-28186 URL: https://issues.apache.org/jira/browse/SPARK-28186 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Alex Kushnir If array of items contains a null item when array_contains returns true if item is found but if item is not found it returns null instead of false Seq( (1, Seq("a", "b", "c")), (2, Seq("a", "b", null, "c")) ).toDF("id", "vals").createOrReplaceTempView("tbl") spark.sql("select id, vals, array_contains(vals, 'a') as has_a, array_contains(vals, 'd') as has_d from tbl").show +---+--+-+-+ | id| vals |has_a|has_d| +---+--+-+-+ | 1| [a, b, c]| true|false| | 2|[a, b,, c]| true| null| +---+--+-+-+ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28186) array_contains returns null instead of false when one of the items in the array is null
[ https://issues.apache.org/jira/browse/SPARK-28186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Kushnir updated SPARK-28186: - Description: If array of items contains a null item then array_contains returns true if item is found but if item is not found it returns null instead of false Seq( (1, Seq("a", "b", "c")), (2, Seq("a", "b", null, "c")) ).toDF("id", "vals").createOrReplaceTempView("tbl") spark.sql("select id, vals, array_contains(vals, 'a') as has_a, array_contains(vals, 'd') as has_d from tbl").show ++-++--+ |id|vals|has_a|has_d| ++-++--+ |1|[a, b, c]|true|false| |2|[a, b,, c]|true|null| ++-++--+ was: If array of items contains a null item when array_contains returns true if item is found but if item is not found it returns null instead of false Seq( (1, Seq("a", "b", "c")), (2, Seq("a", "b", null, "c")) ).toDF("id", "vals").createOrReplaceTempView("tbl") spark.sql("select id, vals, array_contains(vals, 'a') as has_a, array_contains(vals, 'd') as has_d from tbl").show +---+--+-+-+ | id| vals |has_a|has_d| +---+--+-+-+ | 1| [a, b, c]| true|false| | 2|[a, b,, c]| true| null| +---+--+-+-+ > array_contains returns null instead of false when one of the items in the > array is null > --- > > Key: SPARK-28186 > URL: https://issues.apache.org/jira/browse/SPARK-28186 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alex Kushnir >Priority: Major > > If array of items contains a null item then array_contains returns true if > item is found but if item is not found it returns null instead of false > Seq( > (1, Seq("a", "b", "c")), > (2, Seq("a", "b", null, "c")) > ).toDF("id", "vals").createOrReplaceTempView("tbl") > spark.sql("select id, vals, array_contains(vals, 'a') as has_a, > array_contains(vals, 'd') as has_d from tbl").show > ++-++--+ > |id|vals|has_a|has_d| > ++-++--+ > |1|[a, b, c]|true|false| > |2|[a, b,, c]|true|null| > ++-++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28185) Trigger pandas iterator UDF closing stuff when iterator stop early
[ https://issues.apache.org/jira/browse/SPARK-28185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28185: Assignee: Apache Spark > Trigger pandas iterator UDF closing stuff when iterator stop early > -- > > Key: SPARK-28185 > URL: https://issues.apache.org/jira/browse/SPARK-28185 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Major > > Fix the issue Pandas UDF closing stuff won't be triggered when iterator stop > early. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28185) Trigger pandas iterator UDF closing stuff when iterator stop early
[ https://issues.apache.org/jira/browse/SPARK-28185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28185: Assignee: (was: Apache Spark) > Trigger pandas iterator UDF closing stuff when iterator stop early > -- > > Key: SPARK-28185 > URL: https://issues.apache.org/jira/browse/SPARK-28185 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Priority: Major > > Fix the issue Pandas UDF closing stuff won't be triggered when iterator stop > early. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28185) Trigger pandas iterator UDF closing stuff when iterator stop early
Weichen Xu created SPARK-28185: -- Summary: Trigger pandas iterator UDF closing stuff when iterator stop early Key: SPARK-28185 URL: https://issues.apache.org/jira/browse/SPARK-28185 Project: Spark Issue Type: Bug Components: ML, SQL Affects Versions: 2.4.3 Reporter: Weichen Xu Fix the issue Pandas UDF closing stuff won't be triggered when iterator stop early. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28184) Avoid creating new sessions in SparkMetadataOperationSuite
[ https://issues.apache.org/jira/browse/SPARK-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28184: Assignee: Apache Spark > Avoid creating new sessions in SparkMetadataOperationSuite > -- > > Key: SPARK-28184 > URL: https://issues.apache.org/jira/browse/SPARK-28184 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28184) Avoid creating new sessions in SparkMetadataOperationSuite
[ https://issues.apache.org/jira/browse/SPARK-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28184: Assignee: (was: Apache Spark) > Avoid creating new sessions in SparkMetadataOperationSuite > -- > > Key: SPARK-28184 > URL: https://issues.apache.org/jira/browse/SPARK-28184 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28183) Add a task status filter for taskList in REST API
[ https://issues.apache.org/jira/browse/SPARK-28183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28183: Assignee: Apache Spark > Add a task status filter for taskList in REST API > - > > Key: SPARK-28183 > URL: https://issues.apache.org/jira/browse/SPARK-28183 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Major > > We have a scenario that our application needs to query failed tasks by REST > API {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} > when Spark job is running. In a large Stage, it may filter out dozens of > failed tasks from hundred thousands total tasks. It consumes much unnecessary > memory and time both in Spark and App side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28183) Add a task status filter for taskList in REST API
[ https://issues.apache.org/jira/browse/SPARK-28183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28183: Assignee: (was: Apache Spark) > Add a task status filter for taskList in REST API > - > > Key: SPARK-28183 > URL: https://issues.apache.org/jira/browse/SPARK-28183 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > We have a scenario that our application needs to query failed tasks by REST > API {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} > when Spark job is running. In a large Stage, it may filter out dozens of > failed tasks from hundred thousands total tasks. It consumes much unnecessary > memory and time both in Spark and App side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28184) Avoid creating new sessions in SparkMetadataOperationSuite
[ https://issues.apache.org/jira/browse/SPARK-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28184: Summary: Avoid creating new sessions in SparkMetadataOperationSuite (was: Avoid create new session in SparkMetadataOperationSuite) > Avoid creating new sessions in SparkMetadataOperationSuite > -- > > Key: SPARK-28184 > URL: https://issues.apache.org/jira/browse/SPARK-28184 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28184) Avoid create new session in SparkMetadataOperationSuite
Yuming Wang created SPARK-28184: --- Summary: Avoid create new session in SparkMetadataOperationSuite Key: SPARK-28184 URL: https://issues.apache.org/jira/browse/SPARK-28184 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28183) Add a task status filter for taskList in REST API
[ https://issues.apache.org/jira/browse/SPARK-28183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-28183: --- Description: We have a scenario that our application needs to query failed tasks by REST API {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} when Spark job is running. In a large Stage, it may filter out dozens of failed tasks from hundred thousands total tasks. It consumes much unnecessary memory and time both in Spark and App side. was: We have a scenario that our application needs to query failed tasks by REST API {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} when Spark job is running. In a large Stage, it may contain hundred thousands tasks totally. Although it offers a pagination query via {{?offset=[offset]=[len]}}, it still faces two disadvantages: 1. App still needs to query out all tasks. It consumes much unnecessary memory and time both in Spark and App side. 2. Pagination query via {{?offset=[offset]=[len]}} makes the logic much complex and it still needs to handle all tasks. > Add a task status filter for taskList in REST API > - > > Key: SPARK-28183 > URL: https://issues.apache.org/jira/browse/SPARK-28183 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > We have a scenario that our application needs to query failed tasks by REST > API {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} > when Spark job is running. In a large Stage, it may filter out dozens of > failed tasks from hundred thousands total tasks. It consumes much unnecessary > memory and time both in Spark and App side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28183) Add a task status filter for taskList in REST API
Lantao Jin created SPARK-28183: -- Summary: Add a task status filter for taskList in REST API Key: SPARK-28183 URL: https://issues.apache.org/jira/browse/SPARK-28183 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 3.0.0 Reporter: Lantao Jin We have a scenario that our application needs to query failed tasks by REST API {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList}} when Spark job is running. In a large Stage, it may contain hundred thousands tasks totally. Although it offers a pagination query via {{?offset=[offset]=[len]}}, it still faces two disadvantages: 1. App still needs to query out all tasks. It consumes much unnecessary memory and time both in Spark and App side. 2. Pagination query via {{?offset=[offset]=[len]}} makes the logic much complex and it still needs to handle all tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12
[ https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27714: Assignee: (was: Apache Spark) > Support Join Reorder based on Genetic Algorithm when the # of joined tables > > 12 > > > Key: SPARK-27714 > URL: https://issues.apache.org/jira/browse/SPARK-27714 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > Now the join reorder logic is based on dynamic planning which can find the > most optimized plan theoretically, but the searching cost grows rapidly with > the # of joined tables grows. It would be better to introduce Genetic > algorithm (GA) to overcome this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12
[ https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27714: Assignee: Apache Spark > Support Join Reorder based on Genetic Algorithm when the # of joined tables > > 12 > > > Key: SPARK-27714 > URL: https://issues.apache.org/jira/browse/SPARK-27714 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Assignee: Apache Spark >Priority: Major > > Now the join reorder logic is based on dynamic planning which can find the > most optimized plan theoretically, but the searching cost grows rapidly with > the # of joined tables grows. It would be better to introduce Genetic > algorithm (GA) to overcome this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28182) Spark fails to download Hive 2.2+ jars from maven
[ https://issues.apache.org/jira/browse/SPARK-28182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874029#comment-16874029 ] Emlyn Corrin commented on SPARK-28182: -- I can work around it by adding --packages org.apache.zookeeper:zookeeper:3.4.6 to the command line, or by setting an earlier Hive metastore version (it seems to work even with the default 1.2.1 jars when connecting to Hive metastore 2.3.0). > Spark fails to download Hive 2.2+ jars from maven > - > > Key: SPARK-28182 > URL: https://issues.apache.org/jira/browse/SPARK-28182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Emlyn Corrin >Priority: Major > > {{When starting Spark with spark.sql.hive.metastore.jars=maven and a > spark.sql.hive.metastore.version of 2.2 or 2.3 it fails to download the > required jars. It looks like it just downloads the -tests version of the > zookeeper 3.4.6 jar:}} > {noformat} > > rm -rf ~/.ivy2 > > pyspark --conf > > spark.hadoop.hive.metastore.uris=thrift://hive-metastore:1 --conf > > spark.sql.catalogImplementation=hive --conf > > spark.sql.hive.metastore.jars=maven --conf > > spark.sql.hive.metastore.version=2.3 > Python 3.7.3 (default, Mar 27 2019, 09:23:39) > Type 'copyright', 'credits' or 'license' for more information > IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help. > 19/06/27 12:19:11 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.4.3 > /_/ > Using Python version 3.7.3 (default, Mar 27 2019 09:23:39) > SparkSession available as 'spark'. > In [1]: spark.sql('show databases').show(){noformat} > > {noformat} > http://www.datanucleus.org/downloads/maven2 added as a remote repository with > the name: repo-1 > Ivy Default Cache set to: /Users/emcorrin/.ivy2/cache > The jars for the packages stored in: /Users/emcorrin/.ivy2/jars > :: loading settings :: url = > jar:file:/usr/local/Cellar/apache-spark/2.4.3/libexec/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml > org.apache.hive#hive-metastore added as a dependency > org.apache.hive#hive-exec added as a dependency > org.apache.hive#hive-common added as a dependency > org.apache.hive#hive-serde added as a dependency > com.google.guava#guava added as a dependency > org.apache.hadoop#hadoop-client added as a dependency > :: resolving dependencies :: > org.apache.spark#spark-submit-parent-639456d4-9614-45c9-ad3c-567e4fa69f79;1.0 > confs: [default] > found org.apache.hive#hive-metastore;2.3.3 in central > found org.apache.hive#hive-serde;2.3.3 in central > found org.apache.hive#hive-common;2.3.3 in central > found org.apache.hive#hive-shims;2.3.3 in central > found org.apache.hive.shims#hive-shims-common;2.3.3 in central > found org.apache.logging.log4j#log4j-slf4j-impl;2.6.2 in central > found org.slf4j#slf4j-api;1.7.10 in central > found com.google.guava#guava;14.0.1 in central > found commons-lang#commons-lang;2.6 in central > found org.apache.thrift#libthrift;0.9.3 in central > found org.apache.httpcomponents#httpclient;4.4 in central > found org.apache.httpcomponents#httpcore;4.4 in central > found commons-logging#commons-logging;1.2 in central > found commons-codec#commons-codec;1.4 in central > found org.apache.zookeeper#zookeeper;3.4.6 in central > found org.slf4j#slf4j-log4j12;1.6.1 in central > found log4j#log4j;1.2.16 in central > found jline#jline;2.12 in central > found io.netty#netty;3.7.0.Final in central > found org.apache.hive.shims#hive-shims-0.23;2.3.3 in central > found org.apache.hadoop#hadoop-yarn-server-resourcemanager;2.7.2 in central > found org.apache.hadoop#hadoop-annotations;2.7.2 in central > found com.google.inject.extensions#guice-servlet;3.0 in central > found com.google.inject#guice;3.0 in central > found javax.inject#javax.inject;1 in central > found aopalliance#aopalliance;1.0 in central > found org.sonatype.sisu.inject#cglib;2.2.1-v20090111 in central > found asm#asm;3.2 in central > found com.google.protobuf#protobuf-java;2.5.0 in central > found commons-io#commons-io;2.4 in central > found com.sun.jersey#jersey-json;1.14 in central > found org.codehaus.jettison#jettison;1.1 in central > found com.sun.xml.bind#jaxb-impl;2.2.3-1 in central > found javax.xml.bind#jaxb-api;2.2.2 in central > found javax.xml.stream#stax-api;1.0-2 in central > found javax.activation#activation;1.1 in central
[jira] [Created] (SPARK-28182) Spark fails to download Hive 2.2+ jars from maven
Emlyn Corrin created SPARK-28182: Summary: Spark fails to download Hive 2.2+ jars from maven Key: SPARK-28182 URL: https://issues.apache.org/jira/browse/SPARK-28182 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Emlyn Corrin {{When starting Spark with spark.sql.hive.metastore.jars=maven and a spark.sql.hive.metastore.version of 2.2 or 2.3 it fails to download the required jars. It looks like it just downloads the -tests version of the zookeeper 3.4.6 jar:}} {noformat} > rm -rf ~/.ivy2 > pyspark --conf spark.hadoop.hive.metastore.uris=thrift://hive-metastore:1 > --conf spark.sql.catalogImplementation=hive --conf > spark.sql.hive.metastore.jars=maven --conf > spark.sql.hive.metastore.version=2.3 Python 3.7.3 (default, Mar 27 2019, 09:23:39) Type 'copyright', 'credits' or 'license' for more information IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help. 19/06/27 12:19:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.3 /_/ Using Python version 3.7.3 (default, Mar 27 2019 09:23:39) SparkSession available as 'spark'. In [1]: spark.sql('show databases').show(){noformat} {noformat} http://www.datanucleus.org/downloads/maven2 added as a remote repository with the name: repo-1 Ivy Default Cache set to: /Users/emcorrin/.ivy2/cache The jars for the packages stored in: /Users/emcorrin/.ivy2/jars :: loading settings :: url = jar:file:/usr/local/Cellar/apache-spark/2.4.3/libexec/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml org.apache.hive#hive-metastore added as a dependency org.apache.hive#hive-exec added as a dependency org.apache.hive#hive-common added as a dependency org.apache.hive#hive-serde added as a dependency com.google.guava#guava added as a dependency org.apache.hadoop#hadoop-client added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-639456d4-9614-45c9-ad3c-567e4fa69f79;1.0 confs: [default] found org.apache.hive#hive-metastore;2.3.3 in central found org.apache.hive#hive-serde;2.3.3 in central found org.apache.hive#hive-common;2.3.3 in central found org.apache.hive#hive-shims;2.3.3 in central found org.apache.hive.shims#hive-shims-common;2.3.3 in central found org.apache.logging.log4j#log4j-slf4j-impl;2.6.2 in central found org.slf4j#slf4j-api;1.7.10 in central found com.google.guava#guava;14.0.1 in central found commons-lang#commons-lang;2.6 in central found org.apache.thrift#libthrift;0.9.3 in central found org.apache.httpcomponents#httpclient;4.4 in central found org.apache.httpcomponents#httpcore;4.4 in central found commons-logging#commons-logging;1.2 in central found commons-codec#commons-codec;1.4 in central found org.apache.zookeeper#zookeeper;3.4.6 in central found org.slf4j#slf4j-log4j12;1.6.1 in central found log4j#log4j;1.2.16 in central found jline#jline;2.12 in central found io.netty#netty;3.7.0.Final in central found org.apache.hive.shims#hive-shims-0.23;2.3.3 in central found org.apache.hadoop#hadoop-yarn-server-resourcemanager;2.7.2 in central found org.apache.hadoop#hadoop-annotations;2.7.2 in central found com.google.inject.extensions#guice-servlet;3.0 in central found com.google.inject#guice;3.0 in central found javax.inject#javax.inject;1 in central found aopalliance#aopalliance;1.0 in central found org.sonatype.sisu.inject#cglib;2.2.1-v20090111 in central found asm#asm;3.2 in central found com.google.protobuf#protobuf-java;2.5.0 in central found commons-io#commons-io;2.4 in central found com.sun.jersey#jersey-json;1.14 in central found org.codehaus.jettison#jettison;1.1 in central found com.sun.xml.bind#jaxb-impl;2.2.3-1 in central found javax.xml.bind#jaxb-api;2.2.2 in central found javax.xml.stream#stax-api;1.0-2 in central found javax.activation#activation;1.1 in central found org.codehaus.jackson#jackson-core-asl;1.9.13 in central found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central found org.codehaus.jackson#jackson-jaxrs;1.9.13 in central found org.codehaus.jackson#jackson-xc;1.9.13 in central found com.sun.jersey.contribs#jersey-guice;1.9 in central found org.apache.hadoop#hadoop-yarn-common;2.7.2 in central found org.apache.hadoop#hadoop-yarn-api;2.7.2 in central found org.apache.commons#commons-compress;1.9 in central found org.mortbay.jetty#jetty-util;6.1.26 in central found com.sun.jersey#jersey-core;1.14 in central found com.sun.jersey#jersey-client;1.9 in central found
[jira] [Updated] (SPARK-23179) Support option to throw exception if overflow occurs during Decimal arithmetic
[ https://issues.apache.org/jira/browse/SPARK-23179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23179: Summary: Support option to throw exception if overflow occurs during Decimal arithmetic (was: Support option to throw exception if overflow occurs) > Support option to throw exception if overflow occurs during Decimal arithmetic > -- > > Key: SPARK-23179 > URL: https://issues.apache.org/jira/browse/SPARK-23179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > SQL ANSI 2011 states that in case of overflow during arithmetic operations, > an exception should be thrown. This is what most of the SQL DBs do (eg. > SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 > is open to be SQL compliant. > I propose to have a config option which allows to decide whether Spark should > behave according to SQL standards or in the current way (ie. returning NULL). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23179) Support option to throw exception if overflow occurs
[ https://issues.apache.org/jira/browse/SPARK-23179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23179. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 20350 [https://github.com/apache/spark/pull/20350] > Support option to throw exception if overflow occurs > > > Key: SPARK-23179 > URL: https://issues.apache.org/jira/browse/SPARK-23179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > SQL ANSI 2011 states that in case of overflow during arithmetic operations, > an exception should be thrown. This is what most of the SQL DBs do (eg. > SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 > is open to be SQL compliant. > I propose to have a config option which allows to decide whether Spark should > behave according to SQL standards or in the current way (ie. returning NULL). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23179) Support option to throw exception if overflow occurs
[ https://issues.apache.org/jira/browse/SPARK-23179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23179: --- Assignee: Marco Gaido > Support option to throw exception if overflow occurs > > > Key: SPARK-23179 > URL: https://issues.apache.org/jira/browse/SPARK-23179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > > SQL ANSI 2011 states that in case of overflow during arithmetic operations, > an exception should be thrown. This is what most of the SQL DBs do (eg. > SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 > is open to be SQL compliant. > I propose to have a config option which allows to decide whether Spark should > behave according to SQL standards or in the current way (ie. returning NULL). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28181) Add a filter interface to KVStore to speed up the entities retrieve
[ https://issues.apache.org/jira/browse/SPARK-28181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28181: Assignee: Apache Spark > Add a filter interface to KVStore to speed up the entities retrieve > --- > > Key: SPARK-28181 > URL: https://issues.apache.org/jira/browse/SPARK-28181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Major > > Current entities in KVStore only could be retrieved all or none. This ticket > adding a filter interface to KVStore to speed up the entities retrieve. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28181) Add a filter interface to KVStore to speed up the entities retrieve
[ https://issues.apache.org/jira/browse/SPARK-28181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28181: Assignee: (was: Apache Spark) > Add a filter interface to KVStore to speed up the entities retrieve > --- > > Key: SPARK-28181 > URL: https://issues.apache.org/jira/browse/SPARK-28181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Current entities in KVStore only could be retrieved all or none. This ticket > adding a filter interface to KVStore to speed up the entities retrieve. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28181) Add a filter interface to KVStore to speed up the entities retrieve
Lantao Jin created SPARK-28181: -- Summary: Add a filter interface to KVStore to speed up the entities retrieve Key: SPARK-28181 URL: https://issues.apache.org/jira/browse/SPARK-28181 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Lantao Jin Current entities in KVStore only could be retrieved all or none. This ticket adding a filter interface to KVStore to speed up the entities retrieve. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11
[ https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873905#comment-16873905 ] Stavros Kontopoulos commented on SPARK-24417: - How about transitive deps? Dont they need to support java 11? I spotted this one: [https://github.com/twitter/scrooge/pull/300] that is used by twitter chill. > Build and Run Spark on JDK11 > > > Key: SPARK-24417 > URL: https://issues.apache.org/jira/browse/SPARK-24417 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > > This is an umbrella JIRA for Apache Spark to support JDK11 > As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per > community discussion, we will skip JDK9 and 10 to support JDK 11 directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo
[ https://issues.apache.org/jira/browse/SPARK-28178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28178: Assignee: (was: Apache Spark) > DataSourceV2: DataFrameWriter.insertInfo > > > Key: SPARK-28178 > URL: https://issues.apache.org/jira/browse/SPARK-28178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multiple catalogs in the following InsertInto use cases: > * DataFrameWriter.insertInto("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo
[ https://issues.apache.org/jira/browse/SPARK-28178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28178: Assignee: Apache Spark > DataSourceV2: DataFrameWriter.insertInfo > > > Key: SPARK-28178 > URL: https://issues.apache.org/jira/browse/SPARK-28178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: Apache Spark >Priority: Major > > Support multiple catalogs in the following InsertInto use cases: > * DataFrameWriter.insertInto("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset
[ https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-28180: Description: I am converting an _RDD_ spark program to a _Dataset_ one. Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing the conversion more simplier, by loading the CSV content into a Dataset and applying the _Encoders.bean(Entreprise.class)_ on it. {code:java} Dataset csv = this.session.read().format("csv") .option("header","true").option("quote", "\"").option("escape", "\"") .load(source.getAbsolutePath()) .selectExpr( "ActivitePrincipaleUniteLegale as ActivitePrincipale", "CAST(AnneeCategorieEntreprise as INTEGER) as AnneeCategorieEntreprise", "CAST(AnneeEffectifsUniteLegale as INTEGER) as AnneeValiditeEffectifSalarie", "CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as CaractereEmployeur", "CategorieEntreprise", "CategorieJuridiqueUniteLegale as CategorieJuridique", "DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as DateDebutHistorisation", "DateDernierTraitementUniteLegale as DateDernierTraitement", "DenominationUniteLegale as Denomination", "DenominationUsuelle1UniteLegale as DenominationUsuelle1", "DenominationUsuelle2UniteLegale as DenominationUsuelle2", "DenominationUsuelle3UniteLegale as DenominationUsuelle3", "CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as EconomieSocialeSolidaire", "CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active", "IdentifiantAssociationUniteLegale as IdentifiantAssociation", "NicSiegeUniteLegale as NicSiege", "CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes", "NomenclatureActivitePrincipaleUniteLegale as NomenclatureActivitePrincipale", "NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage", "Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", "Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", "PrenomUsuelUniteLegale as PrenomUsuel", "PseudonymeUniteLegale as Pseudonyme", "SexeUniteLegale as Sexe", "SigleUniteLegale as Sigle", "Siren", "TrancheEffectifsUniteLegale as TrancheEffectifSalarie" ); {code} The _Dataset_ is succesfully created. But the following call of _Encoders.bean(Enterprise.class)_ fails : {code:java} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178) at test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98) at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74) at java.util.ArrayList.forEach(ArrayList.java:1257) at
[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset
[ https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-28180: Description: I am converting an _RDD_ spark program to a _Dataset_ one. Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing the conversion more simplier, by loading the CSV content into a Dataset and applying the _Encoders.bean(Entreprise.class)_ on it. {code:java} Dataset csv = this.session.read().format("csv") .option("header","true").option("quote", "\"").option("escape", "\"") .load(source.getAbsolutePath()) .selectExpr( "ActivitePrincipaleUniteLegale as ActivitePrincipale", "CAST(AnneeCategorieEntreprise as INTEGER) as AnneeCategorieEntreprise", "CAST(AnneeEffectifsUniteLegale as INTEGER) as AnneeValiditeEffectifSalarie", "CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as CaractereEmployeur", "CategorieEntreprise", "CategorieJuridiqueUniteLegale as CategorieJuridique", "DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as DateDebutHistorisation", "DateDernierTraitementUniteLegale as DateDernierTraitement", "DenominationUniteLegale as Denomination", "DenominationUsuelle1UniteLegale as DenominationUsuelle1", "DenominationUsuelle2UniteLegale as DenominationUsuelle2", "DenominationUsuelle3UniteLegale as DenominationUsuelle3", "CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as EconomieSocialeSolidaire", "CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active", "IdentifiantAssociationUniteLegale as IdentifiantAssociation", "NicSiegeUniteLegale as NicSiege", "CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes", "NomenclatureActivitePrincipaleUniteLegale as NomenclatureActivitePrincipale", "NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage", "Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", "Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", "PrenomUsuelUniteLegale as PrenomUsuel", "PseudonymeUniteLegale as Pseudonyme", "SexeUniteLegale as Sexe", "SigleUniteLegale as Sigle", "Siren", "TrancheEffectifsUniteLegale as TrancheEffectifSalarie" ); {code} The _Dataset_ is succesfully created. But the following call of _Encoders.bean(Enterprise.class)_ fails : {code:java} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178) at test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98) at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74) at java.util.ArrayList.forEach(ArrayList.java:1257) at
[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset
[ https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-28180: Description: I am converting an _RDD_ spark program to a _Dataset_ one. Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing the conversion more simplier, by loading the CSV content into a Dataset and applying the _Encoders.bean(Entreprise.class)_ on it. {code:java} Dataset csv = this.session.read().format("csv") .option("header","true").option("quote", "\"").option("escape", "\"") .load(source.getAbsolutePath()) .selectExpr( "ActivitePrincipaleUniteLegale as ActivitePrincipale", "CAST(AnneeCategorieEntreprise as INTEGER) as AnneeCategorieEntreprise", "CAST(AnneeEffectifsUniteLegale as INTEGER) as AnneeValiditeEffectifSalarie", "CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as CaractereEmployeur", "CategorieEntreprise", "CategorieJuridiqueUniteLegale as CategorieJuridique", "DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as DateDebutHistorisation", "DateDernierTraitementUniteLegale as DateDernierTraitement", "DenominationUniteLegale as Denomination", "DenominationUsuelle1UniteLegale as DenominationUsuelle1", "DenominationUsuelle2UniteLegale as DenominationUsuelle2", "DenominationUsuelle3UniteLegale as DenominationUsuelle3", "CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as EconomieSocialeSolidaire", "CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active", "IdentifiantAssociationUniteLegale as IdentifiantAssociation", "NicSiegeUniteLegale as NicSiege", "CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes", "NomenclatureActivitePrincipaleUniteLegale as NomenclatureActivitePrincipale", "NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage", "Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", "Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", "PrenomUsuelUniteLegale as PrenomUsuel", "PseudonymeUniteLegale as Pseudonyme", "SexeUniteLegale as Sexe", "SigleUniteLegale as Sigle", "Siren", "TrancheEffectifsUniteLegale as TrancheEffectifSalarie" ); {code} The _Dataset_ is succesfully created. But the following call of _Encoders.bean(Enterprise.class)_ fails : {code:java} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178) at test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98) at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74) at java.util.ArrayList.forEach(ArrayList.java:1257) at
[jira] [Created] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset
M. Le Bihan created SPARK-28180: --- Summary: Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset Key: SPARK-28180 URL: https://issues.apache.org/jira/browse/SPARK-28180 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Environment: Debian 9, Java 8. Reporter: M. Le Bihan I am converting an _RDD_ spark program to a _Dataset_ one. Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing the conversion more simplier, by loading the CSV content into a Dataset and applying the _Encoders.bean(Entreprise.class)_ on it. {code:java} Dataset csv = this.session.read().format("csv") .option("header","true").option("quote", "\"").option("escape", "\"") .load(source.getAbsolutePath()) .selectExpr( "ActivitePrincipaleUniteLegale as ActivitePrincipale", "CAST(AnneeCategorieEntreprise as INTEGER) as AnneeCategorieEntreprise", "CAST(AnneeEffectifsUniteLegale as INTEGER) as AnneeValiditeEffectifSalarie", "CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as CaractereEmployeur", "CategorieEntreprise", "CategorieJuridiqueUniteLegale as CategorieJuridique", "DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as DateDebutHistorisation", "DateDernierTraitementUniteLegale as DateDernierTraitement", "DenominationUniteLegale as Denomination", "DenominationUsuelle1UniteLegale as DenominationUsuelle1", "DenominationUsuelle2UniteLegale as DenominationUsuelle2", "DenominationUsuelle3UniteLegale as DenominationUsuelle3", "CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as EconomieSocialeSolidaire", "CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active", "IdentifiantAssociationUniteLegale as IdentifiantAssociation", "NicSiegeUniteLegale as NicSiege", "CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes", "NomenclatureActivitePrincipaleUniteLegale as NomenclatureActivitePrincipale", "NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage", "Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", "Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", "PrenomUsuelUniteLegale as PrenomUsuel", "PseudonymeUniteLegale as Pseudonyme", "SexeUniteLegale as Sexe", "SigleUniteLegale as Sigle", "Siren", "TrancheEffectifsUniteLegale as TrancheEffectifSalarie" ); {code} The _Dataset_ is succesfully created. But the following call of _Encoders.bean(Enterprise.class)_ fails : {code:java} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178) at test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at