[jira] [Updated] (SPARK-29647) Use Python 3.7 in GitHub Action to recover lint-python
[ https://issues.apache.org/jira/browse/SPARK-29647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29647: -- Component/s: PySpark > Use Python 3.7 in GitHub Action to recover lint-python > -- > > Key: SPARK-29647 > URL: https://issues.apache.org/jira/browse/SPARK-29647 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.5 >Reporter: Dongjoon Hyun >Priority: Minor > > `branch-2.4` seems incompatible with Python 3.7. > Currently, GitHub Action on `branch-2.4` is broken. > This issue aims to recover the GitHub Action `lint-python` first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29647) Use Python 3.7 in GitHub Action to recover lint-python
Dongjoon Hyun created SPARK-29647: - Summary: Use Python 3.7 in GitHub Action to recover lint-python Key: SPARK-29647 URL: https://issues.apache.org/jira/browse/SPARK-29647 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.4.5 Reporter: Dongjoon Hyun `branch-2.4` seems incompatible with Python 3.7. Currently, GitHub Action on `branch-2.4` is broken. This issue aims to recover the GitHub Action `lint-python` first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29646) Allow pyspark version name format `${versionNumber}-preview` in release script
Xingbo Jiang created SPARK-29646: Summary: Allow pyspark version name format `${versionNumber}-preview` in release script Key: SPARK-29646 URL: https://issues.apache.org/jira/browse/SPARK-29646 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Xingbo Jiang We shall allow pyspark version name format `${versionNumber}-preview` in release script, to support preview releases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29645) ML add param RelativeError
zhengruifeng created SPARK-29645: Summary: ML add param RelativeError Key: SPARK-29645 URL: https://issues.apache.org/jira/browse/SPARK-29645 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng {color:#172b4d}It makes sense to expose {{RelativeError}} to end users, since it controls both the{color} {color:#172b4d}precision and memory overhead. {color} {color:#172b4d}[QuantileDiscretizer |https://github.com/apache/spark/compare/master...zhengruifeng:add_relative_err?expand=1#diff-bf4cb764860f82d632ac0730e3d8c605]had added this param, while other algs not yet.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29372) Codegen grows beyond 64 KB for more columns in case of SupportsScanColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-29372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962663#comment-16962663 ] Razibul Hossain commented on SPARK-29372: - I kind of face same issue following some error details CodeGenerator:91 - failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0" grows beyond 64 KB org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0" grows beyond 64 KB at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1417) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490) at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365) at org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:579) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.doExecute(DataSourceV2ScanExec.scala:90) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.doExecute(DataSourceV2ScanExec.scala:90) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at
[jira] [Comment Edited] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962658#comment-16962658 ] Suchintak Patnaik edited comment on SPARK-29621 at 10/30/19 3:38 AM: - [~hyukjin.kwon]As per this, As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, it should not allow referencing only internal corrupt column right?? Then how come df.filter(df._corrupt_record.isNotNull()).count() shows error and df.filter(df._corrupt_record.isNotNull()).show() doesn't?? was (Author: patnaik): [~gurwls223] it should not allow referencing only internal corrupt record right?? Then how come filter.count() shows error and filter.show() doesn't?? > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > {code} > from pyspark.sql.types import * > schema = StructType([ > StructField("_corrupt_record", StringType(), False), > StructField("Name", StringType(), False), > StructField("Colour", StringType(), True), > StructField("Price", IntegerType(), True), > StructField("Quantity", IntegerType(), True)]) > df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() # Allowed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962658#comment-16962658 ] Suchintak Patnaik commented on SPARK-29621: --- [~gurwls223] it should not allow referencing only internal corrupt record right?? Then how come filter.count() shows error and filter.show() doesn't?? > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > {code} > from pyspark.sql.types import * > schema = StructType([ > StructField("_corrupt_record", StringType(), False), > StructField("Name", StringType(), False), > StructField("Colour", StringType(), True), > StructField("Price", IntegerType(), True), > StructField("Quantity", IntegerType(), True)]) > df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() # Allowed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29644) ShortType is wrongly set as Int in JDBCUtils.scala
[ https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiv Prashant Sood updated SPARK-29644: --- Description: @maropu pointed out this issue during [PR 25344|https://github.com/apache/spark/pull/25344] review discussion. In [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala] line number 547 case ShortType => (stmt: PreparedStatement, row: Row, pos: Int) => stmt.setInt(pos + 1, row.getShort(pos)) I dont see any reproducible issue, but this is clearly a problem that must be fixed. was: @maropu pointed out this issue in [PR 25344|https://github.com/apache/spark/pull/25344] In [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala] line number 547 case ShortType => (stmt: PreparedStatement, row: Row, pos: Int) => stmt.setInt(pos + 1, row.getShort(pos)) > ShortType is wrongly set as Int in JDBCUtils.scala > -- > > Key: SPARK-29644 > URL: https://issues.apache.org/jira/browse/SPARK-29644 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Shiv Prashant Sood >Priority: Minor > > @maropu pointed out this issue during [PR > 25344|https://github.com/apache/spark/pull/25344] review discussion. > In > [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala] > line number 547 > case ShortType => > (stmt: PreparedStatement, row: Row, pos: Int) => > stmt.setInt(pos + 1, row.getShort(pos)) > I dont see any reproducible issue, but this is clearly a problem that must be > fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29644) ShortType is wrongly set as Int in JDBCUtils.scala
Shiv Prashant Sood created SPARK-29644: -- Summary: ShortType is wrongly set as Int in JDBCUtils.scala Key: SPARK-29644 URL: https://issues.apache.org/jira/browse/SPARK-29644 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.4, 3.0.0 Reporter: Shiv Prashant Sood @maropu pointed out this issue in [PR 25344|https://github.com/apache/spark/pull/25344] In [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala] line number 547 case ShortType => (stmt: PreparedStatement, row: Row, pos: Int) => stmt.setInt(pos + 1, row.getShort(pos)) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962588#comment-16962588 ] Hyukjin Kwon commented on SPARK-27763: -- +1 !! > Port test cases from PostgreSQL to Spark SQL > > > Key: SPARK-27763 > URL: https://issues.apache.org/jira/browse/SPARK-27763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > To improve the test coverage, we can port the regression tests from the other > popular open source projects to Spark SQL. PostgreSQL is one of the best SQL > systems. Below are the links to the test cases and results. > * Regression test cases: > [https://github.com/postgres/postgres/tree/master/src/test/regress/sql] > * Expected results: > [https://github.com/postgres/postgres/tree/master/src/test/regress/expected] > Spark SQL does not support all the feature sets of PostgreSQL. In the current > stage, we should first comment out these test cases and create the > corresponding JIRAs in SPARK-27764. We can discuss and prioritize which > features we should support. Also, these PostgreSQL regression tests could > also expose the existing bugs of Spark SQL. We should also create the JIRAs > and track them in SPARK-27764. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29620) UnsafeKVExternalSorterSuite failure on bigendian system
[ https://issues.apache.org/jira/browse/SPARK-29620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29620: - Description: {code} spark/sql/core# ../../build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite test {code} {code} UnsafeKVExternalSorterSuite: 12:24:24.305 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable - kv sorting key schema [] and value schema [] *** FAILED *** java.lang.AssertionError: sizeInBytes (4) should be a multiple of 8 at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168) at org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) ... - kv sorting key schema [int] and value schema [] *** FAILED *** java.lang.AssertionError: sizeInBytes (20) should be a multiple of 8 at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168) at org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) ... - kv sorting key schema [] and value schema [int] *** FAILED *** java.lang.AssertionError: sizeInBytes (20) should be a multiple of 8 at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168) at org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) ... - kv sorting key schema [int] and value schema [float,float,double,string,float] *** FAILED *** java.lang.AssertionError: sizeInBytes (2732) should be a multiple of 8 at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168) at org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297) at
[jira] [Commented] (SPARK-29643) ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962581#comment-16962581 ] Huaxin Gao commented on SPARK-29643: I am working on this > ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands > -- > > Key: SPARK-29643 > URL: https://issues.apache.org/jira/browse/SPARK-29643 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 > Environment: ALTER TABLE (DROP PARTITION) should look up > catalog/table like v2 commands >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962580#comment-16962580 ] Hyukjin Kwon commented on SPARK-29621: -- Filter is fine. it doesn't have any problem. > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > {code} > from pyspark.sql.types import * > schema = StructType([ > StructField("_corrupt_record", StringType(), False), > StructField("Name", StringType(), False), > StructField("Colour", StringType(), True), > StructField("Price", IntegerType(), True), > StructField("Quantity", IntegerType(), True)]) > df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() # Allowed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29621. -- Resolution: Not A Problem > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > {code} > from pyspark.sql.types import * > schema = StructType([ > StructField("_corrupt_record", StringType(), False), > StructField("Name", StringType(), False), > StructField("Colour", StringType(), True), > StructField("Price", IntegerType(), True), > StructField("Quantity", IntegerType(), True)]) > df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() # Allowed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29643) ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands
Huaxin Gao created SPARK-29643: -- Summary: ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands Key: SPARK-29643 URL: https://issues.apache.org/jira/browse/SPARK-29643 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Environment: ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29621: - Description: As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column"_ But it's allowing while querying only the internal corrupt record column in case of *filter* operation. {code} from pyspark.sql.types import * schema = StructType([ StructField("_corrupt_record", StringType(), False), StructField("Name", StringType(), False), StructField("Colour", StringType(), True), StructField("Price", IntegerType(), True), StructField("Quantity", IntegerType(), True)]) df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE") df.filter(df._corrupt_record.isNotNull()).show() # Allowed {code} was: As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column"_ But it's allowing while querying only the internal corrupt record column in case of *filter* operation. {code} from pyspark.sql.types import * schema = StructType([ StructField("_corrupt_record",StringType(),False), StructField("Name",StringType(),False), StructField("Colour",StringType(),True), StructField("Price",IntegerType(),True), StructField("Quantity",IntegerType(),True)]) df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE") df.filter(df._corrupt_record.isNotNull()).show() // Allowed {code} > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > {code} > from pyspark.sql.types import * > schema = StructType([ > StructField("_corrupt_record", StringType(), False), > StructField("Name", StringType(), False), > StructField("Colour", StringType(), True), > StructField("Price", IntegerType(), True), > StructField("Quantity", IntegerType(), True)]) > df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() # Allowed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29621: - Description: As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column"_ But it's allowing while querying only the internal corrupt record column in case of *filter* operation. {code} from pyspark.sql.types import * schema = StructType([ StructField("_corrupt_record",StringType(),False), StructField("Name",StringType(),False), StructField("Colour",StringType(),True), StructField("Price",IntegerType(),True), StructField("Quantity",IntegerType(),True)]) df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE") df.filter(df._corrupt_record.isNotNull()).show() // Allowed {code} was: As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column"_ But it's allowing while querying only the internal corrupt record column in case of *filter* operation. from pyspark.sql.types import * schema = StructType([StructField("_corrupt_record",StringType(),False),StructField("Name",StringType(),False),StructField("Colour",StringType(),True),StructField("Price",IntegerType(),True),StructField("Quantity",IntegerType(),True)]) df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE") df.filter(df._corrupt_record.isNotNull()).show() // Allowed > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > {code} > from pyspark.sql.types import * > schema = StructType([ > StructField("_corrupt_record",StringType(),False), > StructField("Name",StringType(),False), > StructField("Colour",StringType(),True), > StructField("Price",IntegerType(),True), > StructField("Quantity",IntegerType(),True)]) > df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() // Allowed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962576#comment-16962576 ] Hyukjin Kwon commented on SPARK-29625: -- [~sanysand...@gmail.com] Please don't just copy and paste log and exception but describe some analysis with a self-contained reproducer. Otherwise, no one can investigate. > Spark Structure Streaming Kafka Wrong Reset Offset twice > > > Key: SPARK-29625 > URL: https://issues.apache.org/jira/browse/SPARK-29625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Sandish Kumar HN >Priority: Major > > Spark Structure Streaming Kafka Reset Offset twice, once with right offsets > and second time with very old offsets > {code} > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-151 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-118 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-85 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 122677634. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-19 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 120504922.* > [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO ContextCleaner: Cleaned accumulator 810 > {code} > which is causing a Data loss issue. > {code} > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, > runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > java.lang.IllegalStateException: Partition topic-52's offset was changed from > 122677598 to 120504922, some data may have been missed. > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have > been lost because they are not available in Kafka any more; either the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - data was aged out > by Kafka or the topic may have been deleted before all the data in the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - topic was > processed. If you don't want your streaming query to fail on such cases, set > the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - source option > "failOnDataLoss" to "false". > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at >
[jira] [Updated] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29625: - Description: Spark Structure Streaming Kafka Reset Offset twice, once with right offsets and second time with very old offsets {code} [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-151 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-118 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-85 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-52 to offset 122677634. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-19 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-52 to offset 120504922.* [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO ContextCleaner: Cleaned accumulator 810 {code} which is causing a Data loss issue. {code} [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - java.lang.IllegalStateException: Partition topic-52's offset was changed from 122677598 to 120504922, some data may have been missed. [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have been lost because they are not available in Kafka any more; either the [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - data was aged out by Kafka or the topic may have been deleted before all the data in the [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - topic was processed. If you don't want your streaming query to fail on such cases, set the [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - source option "failOnDataLoss" to "false". [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.AbstractTraversable.filter(Traversable.scala:104) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:281) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:614) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at
[jira] [Resolved] (SPARK-29627) array_contains should allow column instances in PySpark
[ https://issues.apache.org/jira/browse/SPARK-29627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29627. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26288 [https://github.com/apache/spark/pull/26288] > array_contains should allow column instances in PySpark > --- > > Key: SPARK-29627 > URL: https://issues.apache.org/jira/browse/SPARK-29627 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > Scala API works well with column instances: > {code} > import org.apache.spark.sql.functions._ > val df = Seq(Array("a", "b", "c"), Array.empty[String]).toDF("data") > df.select(array_contains($"data", lit("a"))).collect() > {code} > {code} > Array[org.apache.spark.sql.Row] = Array([true], [false]) > {code} > However, seems PySpark one doesn't: > {code} > from pyspark.sql.functions import array_contains, lit > df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data']) > df.select(array_contains(df.data, lit("a"))).show() > {code} > {code} > Traceback (most recent call last): > File "", line 1, in > File "/.../spark/python/pyspark/sql/functions.py", line 1950, in > array_contains > return Column(sc._jvm.functions.array_contains(_to_java_column(col), value)) > File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1277, in __call__ > File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1241, in _build_args > File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1228, in _get_args > File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", > line 500, in convert > File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__ > raise TypeError("Column is not iterable") > TypeError: Column is not iterable > {code} > We should let it allow -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29629) Support typed integer literal expression
[ https://issues.apache.org/jira/browse/SPARK-29629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29629. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26291 [https://github.com/apache/spark/pull/26291] > Support typed integer literal expression > > > Key: SPARK-29629 > URL: https://issues.apache.org/jira/browse/SPARK-29629 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:java} > postgres=# select date '2001-09-28' + integer '7'; > ?column? > > 2001-10-05 > (1 row)postgres=# select integer '7'; > int4 > -- > 7 > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29629) Support typed integer literal expression
[ https://issues.apache.org/jira/browse/SPARK-29629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29629: Assignee: Kent Yao > Support typed integer literal expression > > > Key: SPARK-29629 > URL: https://issues.apache.org/jira/browse/SPARK-29629 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > {code:java} > postgres=# select date '2001-09-28' + integer '7'; > ?column? > > 2001-10-05 > (1 row)postgres=# select integer '7'; > int4 > -- > 7 > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29638) Spark handles 'NaN' as 0 in sums
[ https://issues.apache.org/jira/browse/SPARK-29638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dylan Guedes updated SPARK-29638: - Description: Currently, Spark handles 'NaN' as 0 in window functions, such that 3+'NaN'=3. PgSQL, on the other hand, handles the entire result as 'NaN', as in 3+'NaN' = 'NaN' I experienced this with the query below: {code:sql} SELECT a, b, SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) FROM (VALUES(1,1),(2,2),(3,(cast('nan' as int))),(4,3),(5,4)) t(a,b); {code} was:Currently, Spark handles 'NaN' as 0 in window functions, such that 3+'NaN'=3. PgSQL, on the other hand, handles the entire result as 'NaN', as in 3+'NaN' = 'NaN' > Spark handles 'NaN' as 0 in sums > > > Key: SPARK-29638 > URL: https://issues.apache.org/jira/browse/SPARK-29638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > Currently, Spark handles 'NaN' as 0 in window functions, such that 3+'NaN'=3. > PgSQL, on the other hand, handles the entire result as 'NaN', as in 3+'NaN' = > 'NaN' > I experienced this with the query below: > {code:sql} > SELECT a, b, >SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) > FROM (VALUES(1,1),(2,2),(3,(cast('nan' as int))),(4,3),(5,4)) t(a,b); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29120) Add create_view.sql
[ https://issues.apache.org/jira/browse/SPARK-29120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Omer updated SPARK-29120: -- Comment: was deleted (was: I will work on this.) > Add create_view.sql > --- > > Key: SPARK-29120 > URL: https://issues.apache.org/jira/browse/SPARK-29120 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29636) Can't parse '11:00 BST' or '2000-10-19 10:23:54+01' signatures to timestamp
[ https://issues.apache.org/jira/browse/SPARK-29636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962542#comment-16962542 ] Aman Omer commented on SPARK-29636: --- Thanks [~DylanGuedes] . I am checking this one. > Can't parse '11:00 BST' or '2000-10-19 10:23:54+01' signatures to timestamp > --- > > Key: SPARK-29636 > URL: https://issues.apache.org/jira/browse/SPARK-29636 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > Currently, Spark can't parse a string such as '11:00 BST' or '2000-10-19 > 10:23:54+01' to timestamp: > {code:sql} > spark-sql> select cast ('11:00 BST' as timestamp); > NULL > Time taken: 2.248 seconds, Fetched 1 row(s) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28746) Add repartitionby hint to support RepartitionByExpression
[ https://issues.apache.org/jira/browse/SPARK-28746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-28746. -- Fix Version/s: 3.0.0 Assignee: ulysses you Resolution: Fixed Resolved by https://github.com/apache/spark/pull/25464 > Add repartitionby hint to support RepartitionByExpression > - > > Key: SPARK-28746 > URL: https://issues.apache.org/jira/browse/SPARK-28746 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.0.0 > > > Now, `RepartitionByExpression` is allowed at Dataset method > `Dataset.repartition()`. But in spark sql, we do not have an equivalent > functionality. > In hive, we can use `distribute by`, so it's worth to add a hint to support > such function. > Similar jira https://issues.apache.org/jira/browse/SPARK-24940 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29592) ALTER TABLE (set partition location) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962531#comment-16962531 ] Terry Kim commented on SPARK-29592: --- OK. Thanks. > ALTER TABLE (set partition location) should look up catalog/table like v2 > commands > -- > > Key: SPARK-29592 > URL: https://issues.apache.org/jira/browse/SPARK-29592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > > ALTER TABLE (set partition location) should look up catalog/table like v2 > commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29592) ALTER TABLE (set partition location) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962527#comment-16962527 ] Ryan Blue commented on SPARK-29592: --- There is not currently a way to alter the partition spec for a table, so I don't think we need to worry about this for now. > ALTER TABLE (set partition location) should look up catalog/table like v2 > commands > -- > > Key: SPARK-29592 > URL: https://issues.apache.org/jira/browse/SPARK-29592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > > ALTER TABLE (set partition location) should look up catalog/table like v2 > commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29592) ALTER TABLE (set partition location) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962522#comment-16962522 ] Terry Kim commented on SPARK-29592: --- + [~rdblue] as well (for setting partition spec) > ALTER TABLE (set partition location) should look up catalog/table like v2 > commands > -- > > Key: SPARK-29592 > URL: https://issues.apache.org/jira/browse/SPARK-29592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > > ALTER TABLE (set partition location) should look up catalog/table like v2 > commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962515#comment-16962515 ] Takeshi Yamamuro commented on SPARK-27763: -- Thanks for the check, [~dongjoon] and [~yumwang] ! ok so I'll file sub-tickets for the left ones later and make prs. > Port test cases from PostgreSQL to Spark SQL > > > Key: SPARK-27763 > URL: https://issues.apache.org/jira/browse/SPARK-27763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > To improve the test coverage, we can port the regression tests from the other > popular open source projects to Spark SQL. PostgreSQL is one of the best SQL > systems. Below are the links to the test cases and results. > * Regression test cases: > [https://github.com/postgres/postgres/tree/master/src/test/regress/sql] > * Expected results: > [https://github.com/postgres/postgres/tree/master/src/test/regress/expected] > Spark SQL does not support all the feature sets of PostgreSQL. In the current > stage, we should first comment out these test cases and create the > corresponding JIRAs in SPARK-27764. We can discuss and prioritize which > features we should support. Also, these PostgreSQL regression tests could > also expose the existing bugs of Spark SQL. We should also create the JIRAs > and track them in SPARK-27764. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29639) Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch
[ https://issues.apache.org/jira/browse/SPARK-29639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhinav Choudhury resolved SPARK-29639. --- Resolution: Abandoned Resolved because of lack of enough info and steps to reproduce > Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch > > > Key: SPARK-29639 > URL: https://issues.apache.org/jira/browse/SPARK-29639 > Project: Spark > Issue Type: Bug > Components: Input/Output, Structured Streaming >Affects Versions: 2.4.0 >Reporter: Abhinav Choudhury >Priority: Major > > We have been running a Spark structured job on production for more than a > week now. Put simply, it reads data from source Kafka topics (with 4 > partitions) and writes to another kafka topic. Everything has been running > fine until the job started failing with the following error: > > {noformat} > Driver stacktrace: > === Streaming Query === > Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId > = 613a21ad-86e3-4781-891b-17d92c18954a] > Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: > {"kafka-topic-name": > {"2":10458347,"1":10460151,"3":10475678,"0":9809564} > }} > Current Available Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: > {"kafka-topic-name": > {"2":10458347,"1":10460151,"3":10475678,"0":10509527} > }} > Current State: ACTIVE > Thread State: RUNNABLE > <-- Removed Logical plan --> > Some data may have been lost because they are not available in Kafka any > more; either the > data was aged out by Kafka or the topic may have been deleted before all the > data in the > topic was processed. If you don't want your streaming query to fail on such > cases, set the > source option "failOnDataLoss" to "false".{noformat} > Configuration: > {noformat} > Spark 2.4.0 > Spark-sql-kafka 0.10{noformat} > Looking at the Spark structured streaming query progress logs, it seems like > the endOffsets computed for the next batch was actually smaller than the > starting offset: > *Microbatch Trigger 1:* > {noformat} > 2019/10/26 23:53:51 INFO utils.Logging[26]: 2019-10-26 23:53:51.767 : ( : > Query { > "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b", > "runId" : "2d20d633-2768-446c-845b-893243361422", > "name" : "StreamingProcessorName", > "timestamp" : "2019-10-26T23:53:51.741Z", > "batchId" : 2145898, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0, > "durationMs" : { > "getEndOffset" : 0, > "setOffsetRange" : 9, > "triggerExecution" : 9 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[kafka-topic-name]]", > "startOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 10469196, > "0" : 10503762 > } > }, > "endOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 10469196, > "0" : 10503762 > } > }, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0 > } ], > "sink" : { > "description" : "ForeachBatchSink" > } > } in progress{noformat} > *Next micro batch trigger:* > {noformat} > 2019/10/26 23:53:53 INFO utils.Logging[26]: 2019-10-26 23:53:53.951 : ( : > Query { > "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b", > "runId" : "2d20d633-2768-446c-845b-893243361422", > "name" : "StreamingProcessorName", > "timestamp" : "2019-10-26T23:53:52.907Z", > "batchId" : 2145898, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0, > "durationMs" : { > "addBatch" : 350, > "getBatch" : 4, > "getEndOffset" : 0, > "queryPlanning" : 102, > "setOffsetRange" : 24, > "triggerExecution" : 1043, > "walCommit" : 349 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[kafka-topic-name]]", > "startOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 10469196, > "0" : 10503762 > } > }, > "endOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 9773098, > "0" : 10503762 > } > }, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0 > } ], > "sink" : { > "description" : "ForeachBatchSink" > } > } in progress{noformat} > Notice that for partition 3 of the kafka topic, the endOffsets are actually > smaller than the starting offsets! > Checked the HDFS checkpoint dir and the checkpointed offsets look fine and > point to the last committed offsets > Why is the
[jira] [Created] (SPARK-29642) ContinuousMemoryStream throws error on String type
Jungtaek Lim created SPARK-29642: Summary: ContinuousMemoryStream throws error on String type Key: SPARK-29642 URL: https://issues.apache.org/jira/browse/SPARK-29642 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim While we can set String as a generic type of ContinuousMemoryStream, it doesn't work really because it doesn't convert String to UTFString and accessing it from Row interface would throw error. We should encode the input and convert the input to Row properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29641) Stage Level Sched: Add python api's and test
Thomas Graves created SPARK-29641: - Summary: Stage Level Sched: Add python api's and test Key: SPARK-29641 URL: https://issues.apache.org/jira/browse/SPARK-29641 Project: Spark Issue Type: Story Components: PySpark Affects Versions: 3.0.0 Reporter: Thomas Graves For the Stage Level scheduling feature, add any missing python api and test it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29497) Cannot assign instance of java.lang.invoke.SerializedLambda to field
[ https://issues.apache.org/jira/browse/SPARK-29497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Russo updated SPARK-29497: -- Description: Note this is for scala 2.12: There seems to be an issue in spark with serializing a udf that is created from a function assigned to a class member that references another function assigned to a class member. This is similar to https://issues.apache.org/jira/browse/SPARK-25047 but it looks like the resolution has an issue with this case. After trimming it down to the base issue I came up with the following to reproduce: {code:java} object TestLambdaShell extends Serializable { val hello: String => String = s => s"hello $s!" val lambdaTest: String => String = hello( _ ) def functionTest: String => String = hello( _ ) } val hello = udf( TestLambdaShell.hello ) val functionTest = udf( TestLambdaShell.functionTest ) val lambdaTest = udf( TestLambdaShell.lambdaTest ) sc.parallelize(Seq("world"),1).toDF("test").select(hello($"test")).show(1) sc.parallelize(Seq("world"),1).toDF("test").select(functionTest($"test")).show(1) sc.parallelize(Seq("world"),1).toDF("test").select(lambdaTest($"test")).show(1) {code} All of which works except the last line which results in an exception on the executors: {code:java} Caused by: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$.lambdaTest of type scala.Function1 in instance of $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$ at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2136) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at
[jira] [Commented] (SPARK-29561) Large Case Statement Code Generation OOM
[ https://issues.apache.org/jira/browse/SPARK-29561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962361#comment-16962361 ] Michael Chen commented on SPARK-29561: -- If I increase the memory, it will run into the generated code grows beyond 64 KB exception and disable whole stage code generation for the plan. So that is ok. But if I increase the number of branches/complexity of the branches, it will just run into the OOM problem again. > Large Case Statement Code Generation OOM > > > Key: SPARK-29561 > URL: https://issues.apache.org/jira/browse/SPARK-29561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michael Chen >Priority: Major > Attachments: apacheSparkCase.sql > > > Spark Configuration > spark.driver.memory = 1g > spark.master = "local" > spark.deploy.mode = "client" > Try to execute a case statement with 3000+ branches. Added sql statement as > attachment > Spark runs for a while before it OOM > {noformat} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320) > at > org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) > at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) > 19/10/22 16:19:54 ERROR FileFormatWriter: Aborting job null. > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.HashMap.newNode(HashMap.java:1750) > at java.util.HashMap.putVal(HashMap.java:631) > at java.util.HashMap.putMapEntries(HashMap.java:515) > at java.util.HashMap.putAll(HashMap.java:785) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3345) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3230) > at > org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3198) > at > org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:3351) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3254) > at org.codehaus.janino.UnitCompiler.access$3900(UnitCompiler.java:212) > at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3216) > at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3198) > at org.codehaus.janino.Java$Block.accept(Java.java:2756) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3260) > at org.codehaus.janino.UnitCompiler.access$4000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3217) > at > org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3198) > at org.codehaus.janino.Java$DoStatement.accept(Java.java:3304) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3186) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3009) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958) > at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286) > 19/10/22 16:19:54 ERROR Utils: throw uncaught fatal error in thread Spark > Context Cleaner > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320) > at > org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) > at >
[jira] [Comment Edited] (SPARK-29561) Large Case Statement Code Generation OOM
[ https://issues.apache.org/jira/browse/SPARK-29561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962361#comment-16962361 ] Michael Chen edited comment on SPARK-29561 at 10/29/19 7:59 PM: If I increase the memory, it will run into the generated code grows beyond 64 KB exception and disable whole stage code generation for the plan. But if I increase the number of branches/complexity of the branches, it will just run into the OOM problem again. was (Author: mikechen): If I increase the memory, it will run into the generated code grows beyond 64 KB exception and disable whole stage code generation for the plan. So that is ok. But if I increase the number of branches/complexity of the branches, it will just run into the OOM problem again. > Large Case Statement Code Generation OOM > > > Key: SPARK-29561 > URL: https://issues.apache.org/jira/browse/SPARK-29561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michael Chen >Priority: Major > Attachments: apacheSparkCase.sql > > > Spark Configuration > spark.driver.memory = 1g > spark.master = "local" > spark.deploy.mode = "client" > Try to execute a case statement with 3000+ branches. Added sql statement as > attachment > Spark runs for a while before it OOM > {noformat} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320) > at > org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) > at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) > 19/10/22 16:19:54 ERROR FileFormatWriter: Aborting job null. > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.HashMap.newNode(HashMap.java:1750) > at java.util.HashMap.putVal(HashMap.java:631) > at java.util.HashMap.putMapEntries(HashMap.java:515) > at java.util.HashMap.putAll(HashMap.java:785) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3345) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3230) > at > org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3198) > at > org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:3351) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3254) > at org.codehaus.janino.UnitCompiler.access$3900(UnitCompiler.java:212) > at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3216) > at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3198) > at org.codehaus.janino.Java$Block.accept(Java.java:2756) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3260) > at org.codehaus.janino.UnitCompiler.access$4000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3217) > at > org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3198) > at org.codehaus.janino.Java$DoStatement.accept(Java.java:3304) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3186) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3009) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958) > at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286) > 19/10/22 16:19:54 ERROR Utils: throw uncaught fatal error in thread Spark > Context Cleaner > java.lang.OutOfMemoryError: GC overhead limit exceeded > at >
[jira] [Comment Edited] (SPARK-29561) Large Case Statement Code Generation OOM
[ https://issues.apache.org/jira/browse/SPARK-29561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962361#comment-16962361 ] Michael Chen edited comment on SPARK-29561 at 10/29/19 7:59 PM: Yes it works when I increase the driver memory. It just runs into the generated code grows beyond 64 KB exception and then disables whole stage code generation for the plan. But if I increase the number of branches/complexity of the branches, it will just run into the OOM problem again. was (Author: mikechen): If I increase the memory, it will run into the generated code grows beyond 64 KB exception and disable whole stage code generation for the plan. But if I increase the number of branches/complexity of the branches, it will just run into the OOM problem again. > Large Case Statement Code Generation OOM > > > Key: SPARK-29561 > URL: https://issues.apache.org/jira/browse/SPARK-29561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michael Chen >Priority: Major > Attachments: apacheSparkCase.sql > > > Spark Configuration > spark.driver.memory = 1g > spark.master = "local" > spark.deploy.mode = "client" > Try to execute a case statement with 3000+ branches. Added sql statement as > attachment > Spark runs for a while before it OOM > {noformat} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320) > at > org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) > at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) > 19/10/22 16:19:54 ERROR FileFormatWriter: Aborting job null. > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.HashMap.newNode(HashMap.java:1750) > at java.util.HashMap.putVal(HashMap.java:631) > at java.util.HashMap.putMapEntries(HashMap.java:515) > at java.util.HashMap.putAll(HashMap.java:785) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3345) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3230) > at > org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3198) > at > org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:3351) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3254) > at org.codehaus.janino.UnitCompiler.access$3900(UnitCompiler.java:212) > at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3216) > at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3198) > at org.codehaus.janino.Java$Block.accept(Java.java:2756) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3260) > at org.codehaus.janino.UnitCompiler.access$4000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3217) > at > org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3198) > at org.codehaus.janino.Java$DoStatement.accept(Java.java:3304) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3186) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3009) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958) > at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286) > 19/10/22 16:19:54 ERROR Utils: throw uncaught fatal error in thread Spark > Context Cleaner > java.lang.OutOfMemoryError: GC overhead limit exceeded > at >
[jira] [Commented] (SPARK-29592) ALTER TABLE (set partition location) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962333#comment-16962333 ] Terry Kim commented on SPARK-29592: --- [~cloud_fan], is the spec for setting partitions finalized in v2 catalog? For example, the location can be changed as follow: {code:scala} val changes = Seq(TableChange.setProperty("location", newLoc)) createAlterTable(nameParts, catalog, tableName, changes) {code} Can partitions be also changed via `TableChange`? > ALTER TABLE (set partition location) should look up catalog/table like v2 > commands > -- > > Key: SPARK-29592 > URL: https://issues.apache.org/jira/browse/SPARK-29592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > > ALTER TABLE (set partition location) should look up catalog/table like v2 > commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description
[ https://issues.apache.org/jira/browse/SPARK-29637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29637. Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 26295 [https://github.com/apache/spark/pull/26295] > SHS Endpoint /applications//jobs/ doesn't include description > - > > Key: SPARK-29637 > URL: https://issues.apache.org/jira/browse/SPARK-29637 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > Starting from Spark 2.3, the SHS REST API endpoint > /applications//jobs/ is not including description in the JobData > returned. This is not the case until Spark 2.2. > Steps to reproduce: > * Open spark-shell > {code:java} > scala> sc.setJobGroup("test", "job", false); > scala> val foo = sc.textFile("/user/foo.txt"); > foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at > textFile at :24 > scala> foo.foreach(println); > {code} > * Access end REST API > [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/ > * REST API of Spark 2.3 and above will not return description -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description
[ https://issues.apache.org/jira/browse/SPARK-29637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-29637: -- Assignee: Gabor Somogyi > SHS Endpoint /applications//jobs/ doesn't include description > - > > Key: SPARK-29637 > URL: https://issues.apache.org/jira/browse/SPARK-29637 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > > Starting from Spark 2.3, the SHS REST API endpoint > /applications//jobs/ is not including description in the JobData > returned. This is not the case until Spark 2.2. > Steps to reproduce: > * Open spark-shell > {code:java} > scala> sc.setJobGroup("test", "job", false); > scala> val foo = sc.textFile("/user/foo.txt"); > foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at > textFile at :24 > scala> foo.foreach(println); > {code} > * Access end REST API > [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/ > * REST API of Spark 2.3 and above will not return description -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29640) [K8S] Make it possible to set DNS option to use TCP instead of UDP
Andy Grove created SPARK-29640: -- Summary: [K8S] Make it possible to set DNS option to use TCP instead of UDP Key: SPARK-29640 URL: https://issues.apache.org/jira/browse/SPARK-29640 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.4 Reporter: Andy Grove Fix For: 2.4.5 We are running into intermittent DNS issues where the Spark driver fails to resolve "kubernetes.default.svc" and this seems to be caused by [https://github.com/kubernetes/kubernetes/issues/76790] One suggested workaround is to specify TCP mode for DNS lookups in the pod spec ([https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508]). I would like the ability to provide a flag to spark-submit to specify to use TCP mode for DNS lookups. I am working on a PR for this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing
[ https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962255#comment-16962255 ] Steve Loughran commented on SPARK-27676: 1. Well see if you can create a backport PR and get it through. 1. If you are working on AWS with a S3 store that does not have a consistency layer on top of it (consistent EMR, S3Guard) and you are relying on directory listings to enumerate the content of your tables -you are doomed. Seriously. In particular if you're using the commit-by-rename committers then task and job commit may miss newly created files. Which means, if you are seeing the problem discussed here it may be a symptom of a larger issue. > InMemoryFileIndex should hard-fail on missing files instead of logging and > continuing > - > > Key: SPARK-27676 > URL: https://issues.apache.org/jira/browse/SPARK-27676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 3.0.0 > > > Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} > exceptions are caught and logged as warnings (during [directory > listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274] > and [block location > lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]). > I think that this is a dangerous default behavior and would prefer that > Spark hard-fails by default (with the ignore-and-continue behavior guarded by > a SQL session configuration). > In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. > Quoting from the PR for SPARK-17599: > {quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. > If a folder is deleted at any time between the paths were resolved and the > file catalog can check for the folder, the Spark job fails. This may abruptly > stop long running StructuredStreaming jobs for example. > Folders may be deleted by users or automatically by retention policies. These > cases should not prevent jobs from successfully completing. > {quote} > Let's say that I'm *not* expecting to ever delete input files for my job. In > that case, this behavior can mask bugs. > One straightforward masked bug class is accidental file deletion: if I'm > never expecting to delete files then I'd prefer to fail my job if Spark sees > deleted files. > A more subtle bug can occur when using a S3 filesystem. Say I'm running a > Spark job against a partitioned Parquet dataset which is laid out like this: > {code:java} > data/ > date=1/ > region=west/ >0.parquet >1.parquet > region=east/ >0.parquet >1.parquet{code} > If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform > multiple rounds of file listing, first listing {{/data/date=1}} to discover > the partitions for that date, then listing within each partition to discover > the leaf files. Due to the eventual consistency of S3 ListObjects, it's > possible that the first listing will show the {{region=west}} and > {{region=east}} partitions existing and then the next-level listing fails to > return any for some of the directories (e.g. {{/data/date=1/}} returns files > but {{/data/date=1/region=west/}} throws a {{FileNotFoundException}} in S3A > due to ListObjects inconsistency). > If Spark propagated the {{FileNotFoundException}} and hard-failed in this > case then I'd be able to fail the job in this case where we _definitely_ know > that the S3 listing is inconsistent (failing here doesn't guard against _all_ > potential S3 list inconsistency issues (e.g. back-to-back listings which both > return a subset of the true set of objects), but I think it's still an > improvement to fail for the subset of cases that we _can_ detect even if > that's not a surefire failsafe against the more general problem). > Finally, I'm unsure if the original patch will have the desired effect: if a > file is deleted once a Spark job expects to read it then that can cause > problems at multiple layers, both in the driver (multiple rounds of file > listing) and in executors (if the deletion occurs after the construction of > the catalog but before the scheduling of the read tasks); I think the > original patch only resolved the problem for the driver (unless I'm missing > similar executor-side code specific to the original streaming use-case). > Given all of these reasons, I think that the "ignore potentially deleted
[jira] [Assigned] (SPARK-29607) Move static methods from CalendarInterval to IntervalUtils
[ https://issues.apache.org/jira/browse/SPARK-29607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29607: --- Assignee: Maxim Gekk > Move static methods from CalendarInterval to IntervalUtils > -- > > Key: SPARK-29607 > URL: https://issues.apache.org/jira/browse/SPARK-29607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Move static methods from the CalendarInterval class to the helper object > IntervalUtils. Need to rewrite Java code to Scala code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29607) Move static methods from CalendarInterval to IntervalUtils
[ https://issues.apache.org/jira/browse/SPARK-29607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29607. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26261 [https://github.com/apache/spark/pull/26261] > Move static methods from CalendarInterval to IntervalUtils > -- > > Key: SPARK-29607 > URL: https://issues.apache.org/jira/browse/SPARK-29607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Move static methods from the CalendarInterval class to the helper object > IntervalUtils. Need to rewrite Java code to Scala code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962248#comment-16962248 ] Dongjoon Hyun commented on SPARK-27763: --- +1, [~maropu]! > Port test cases from PostgreSQL to Spark SQL > > > Key: SPARK-27763 > URL: https://issues.apache.org/jira/browse/SPARK-27763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > To improve the test coverage, we can port the regression tests from the other > popular open source projects to Spark SQL. PostgreSQL is one of the best SQL > systems. Below are the links to the test cases and results. > * Regression test cases: > [https://github.com/postgres/postgres/tree/master/src/test/regress/sql] > * Expected results: > [https://github.com/postgres/postgres/tree/master/src/test/regress/expected] > Spark SQL does not support all the feature sets of PostgreSQL. In the current > stage, we should first comment out these test cases and create the > corresponding JIRAs in SPARK-27764. We can discuss and prioritize which > features we should support. Also, these PostgreSQL regression tests could > also expose the existing bugs of Spark SQL. We should also create the JIRAs > and track them in SPARK-27764. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28938) Move to supported OpenJDK docker image for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-28938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962243#comment-16962243 ] Dongjoon Hyun commented on SPARK-28938: --- Thank you, [~jkleckner]. It landed to `branch-2.4`, too. > Move to supported OpenJDK docker image for Kubernetes > - > > Key: SPARK-28938 > URL: https://issues.apache.org/jira/browse/SPARK-28938 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.4, 3.0.0 > Environment: Kubernetes >Reporter: Rodney Aaron Stainback >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > Attachments: cve-spark-py.txt, cve-spark-r.txt, cve-spark.txt, > twistlock.txt > > > The current docker image used by Kubernetes > {code:java} > openjdk:8-alpine{code} > is not supported > [https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links] > It was removed with this commit > [https://github.com/docker-library/openjdk/commit/3eb0351b208d739fac35345c85e3c6237c2114ec#diff-f95ffa3d134732c33f7b8368e099] > Quote from commit "4. no more OpenJDK 8 Alpine images (Alpine/musl is not > officially supported by the OpenJDK project, so this reflects that -- see > "Project Portola" for the Alpine porting efforts which I understand are still > in need of help)" > > Please move to a supported image for Kubernetes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962228#comment-16962228 ] Sandish Kumar HN commented on SPARK-29625: -- [~tdas] It would be very helpful if you could review this. > Spark Structure Streaming Kafka Wrong Reset Offset twice > > > Key: SPARK-29625 > URL: https://issues.apache.org/jira/browse/SPARK-29625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Sandish Kumar HN >Priority: Major > > Spark Structure Streaming Kafka Reset Offset twice, once with right offsets > and second time with very old offsets > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-151 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-118 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-85 to offset 0. > ** > *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 122677634.* > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-19 to offset 0. > ** > *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 120504922.* > [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO ContextCleaner: Cleaned accumulator 810 > which is causing a Data loss issue. > > {{[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, > runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > java.lang.IllegalStateException: Partition topic-52's offset was changed from > 122677598 to 120504922, some data may have been missed. > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have > been lost because they are not available in Kafka any more; either the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - data was aged out > by Kafka or the topic may have been deleted before all the data in the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - topic was > processed. If you don't want your streaming query to fail on such cases, set > the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - source option > "failOnDataLoss" to "false". > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) > [2019-10-28 19:27:40,351]
[jira] [Updated] (SPARK-29639) Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch
[ https://issues.apache.org/jira/browse/SPARK-29639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhinav Choudhury updated SPARK-29639: -- Description: We have been running a Spark structured job on production for more than a week now. Put simply, it reads data from source Kafka topics (with 4 partitions) and writes to another kafka topic. Everything has been running fine until the job started failing with the following error: {noformat} Driver stacktrace: === Streaming Query === Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId = 613a21ad-86e3-4781-891b-17d92c18954a] Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: {"kafka-topic-name": {"2":10458347,"1":10460151,"3":10475678,"0":9809564} }} Current Available Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: {"kafka-topic-name": {"2":10458347,"1":10460151,"3":10475678,"0":10509527} }} Current State: ACTIVE Thread State: RUNNABLE <-- Removed Logical plan --> Some data may have been lost because they are not available in Kafka any more; either the data was aged out by Kafka or the topic may have been deleted before all the data in the topic was processed. If you don't want your streaming query to fail on such cases, set the source option "failOnDataLoss" to "false".{noformat} Configuration: {noformat} Spark 2.4.0 Spark-sql-kafka 0.10{noformat} Looking at the Spark structured streaming query progress logs, it seems like the endOffsets computed for the next batch was actually smaller than the starting offset: *Microbatch Trigger 1:* {noformat} 2019/10/26 23:53:51 INFO utils.Logging[26]: 2019-10-26 23:53:51.767 : ( : Query { "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b", "runId" : "2d20d633-2768-446c-845b-893243361422", "name" : "StreamingProcessorName", "timestamp" : "2019-10-26T23:53:51.741Z", "batchId" : 2145898, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, "durationMs" : { "getEndOffset" : 0, "setOffsetRange" : 9, "triggerExecution" : 9 }, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaV2[Subscribe[kafka-topic-name]]", "startOffset" : { "kafka-topic-name" : { "2" : 10452513, "1" : 10454326, "3" : 10469196, "0" : 10503762 } }, "endOffset" : { "kafka-topic-name" : { "2" : 10452513, "1" : 10454326, "3" : 10469196, "0" : 10503762 } }, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0 } ], "sink" : { "description" : "ForeachBatchSink" } } in progress{noformat} *Next micro batch trigger:* {noformat} 2019/10/26 23:53:53 INFO utils.Logging[26]: 2019-10-26 23:53:53.951 : ( : Query { "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b", "runId" : "2d20d633-2768-446c-845b-893243361422", "name" : "StreamingProcessorName", "timestamp" : "2019-10-26T23:53:52.907Z", "batchId" : 2145898, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, "durationMs" : { "addBatch" : 350, "getBatch" : 4, "getEndOffset" : 0, "queryPlanning" : 102, "setOffsetRange" : 24, "triggerExecution" : 1043, "walCommit" : 349 }, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaV2[Subscribe[kafka-topic-name]]", "startOffset" : { "kafka-topic-name" : { "2" : 10452513, "1" : 10454326, "3" : 10469196, "0" : 10503762 } }, "endOffset" : { "kafka-topic-name" : { "2" : 10452513, "1" : 10454326, "3" : 9773098, "0" : 10503762 } }, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0 } ], "sink" : { "description" : "ForeachBatchSink" } } in progress{noformat} Notice that for partition 3 of the kafka topic, the endOffsets are actually smaller than the starting offsets! Checked the HDFS checkpoint dir and the checkpointed offsets look fine and point to the last committed offsets Why is the end offset for a partition being computed to a smaller value? was: We have been running a Spark structured job on production for more than a week now. Put simply, it reads data from source Kafka topics (with 4 partitions) and writes to another kafka topic. Everything has been running fine until the job started failing with the following error: {noformat} Driver stacktrace: === Streaming Query === Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId = 613a21ad-86e3-4781-891b-17d92c18954a] Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: {"kafka-topic-name": {"2":10458347,"1":10460151,"3":10475678,"0":9809564} }} Current Available Offsets: {KafkaV2[Subscribe[kf-adsk-inquirystep]]: {"kf-adsk-inquirystep":
[jira] [Updated] (SPARK-29639) Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch
[ https://issues.apache.org/jira/browse/SPARK-29639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhinav Choudhury updated SPARK-29639: -- Target Version/s: 2.4.4, 2.4.5 > Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch > > > Key: SPARK-29639 > URL: https://issues.apache.org/jira/browse/SPARK-29639 > Project: Spark > Issue Type: Bug > Components: Input/Output, Structured Streaming >Affects Versions: 2.4.0 >Reporter: Abhinav Choudhury >Priority: Major > > We have been running a Spark structured job on production for more than a > week now. Put simply, it reads data from source Kafka topics (with 4 > partitions) and writes to another kafka topic. Everything has been running > fine until the job started failing with the following error: > > {noformat} > Driver stacktrace: > === Streaming Query === > Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId > = 613a21ad-86e3-4781-891b-17d92c18954a] > Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: > {"kafka-topic-name": > {"2":10458347,"1":10460151,"3":10475678,"0":9809564} > }} > Current Available Offsets: {KafkaV2[Subscribe[kf-adsk-inquirystep]]: > {"kf-adsk-inquirystep": > {"2":10458347,"1":10460151,"3":10475678,"0":10509527} > }} > Current State: ACTIVE > Thread State: RUNNABLE > <-- Removed Logical plan --> > Some data may have been lost because they are not available in Kafka any > more; either the > data was aged out by Kafka or the topic may have been deleted before all the > data in the > topic was processed. If you don't want your streaming query to fail on such > cases, set the > source option "failOnDataLoss" to "false".{noformat} > Configuration: > {noformat} > Spark 2.4.0 > Spark-sql-kafka 0.10{noformat} > Looking at the Spark structured streaming query progress logs, it seems like > the endOffsets computed for the next batch was actually smaller than the > starting offset: > *Microbatch Trigger 1:* > {noformat} > 2019/10/26 23:53:51 INFO utils.Logging[26]: 2019-10-26 23:53:51.767 : ( : > Query { > "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b", > "runId" : "2d20d633-2768-446c-845b-893243361422", > "name" : "StreamingProcessorName", > "timestamp" : "2019-10-26T23:53:51.741Z", > "batchId" : 2145898, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0, > "durationMs" : { > "getEndOffset" : 0, > "setOffsetRange" : 9, > "triggerExecution" : 9 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[kafka-topic-name]]", > "startOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 10469196, > "0" : 10503762 > } > }, > "endOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 10469196, > "0" : 10503762 > } > }, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0 > } ], > "sink" : { > "description" : "ForeachBatchSink" > } > } in progress{noformat} > *Next micro batch trigger:* > {noformat} > 2019/10/26 23:53:53 INFO utils.Logging[26]: 2019-10-26 23:53:53.951 : ( : > Query { > "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b", > "runId" : "2d20d633-2768-446c-845b-893243361422", > "name" : "StreamingProcessorName", > "timestamp" : "2019-10-26T23:53:52.907Z", > "batchId" : 2145898, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0, > "durationMs" : { > "addBatch" : 350, > "getBatch" : 4, > "getEndOffset" : 0, > "queryPlanning" : 102, > "setOffsetRange" : 24, > "triggerExecution" : 1043, > "walCommit" : 349 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[kafka-topic-name]]", > "startOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 10469196, > "0" : 10503762 > } > }, > "endOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 9773098, > "0" : 10503762 > } > }, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0 > } ], > "sink" : { > "description" : "ForeachBatchSink" > } > } in progress{noformat} > Notice that for partition 3 of the kafka topic, the endOffsets are actually > smaller than the starting offsets! > Checked the HDFS checkpoint dir and the checkpointed offsets look fine and > point to the last committed offsets > Why is the end offset for a partition being computed to a
[jira] [Created] (SPARK-29639) Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch
Abhinav Choudhury created SPARK-29639: - Summary: Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch Key: SPARK-29639 URL: https://issues.apache.org/jira/browse/SPARK-29639 Project: Spark Issue Type: Bug Components: Input/Output, Structured Streaming Affects Versions: 2.4.0 Reporter: Abhinav Choudhury We have been running a Spark structured job on production for more than a week now. Put simply, it reads data from source Kafka topics (with 4 partitions) and writes to another kafka topic. Everything has been running fine until the job started failing with the following error: {noformat} Driver stacktrace: === Streaming Query === Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId = 613a21ad-86e3-4781-891b-17d92c18954a] Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: {"kafka-topic-name": {"2":10458347,"1":10460151,"3":10475678,"0":9809564} }} Current Available Offsets: {KafkaV2[Subscribe[kf-adsk-inquirystep]]: {"kf-adsk-inquirystep": {"2":10458347,"1":10460151,"3":10475678,"0":10509527} }} Current State: ACTIVE Thread State: RUNNABLE <-- Removed Logical plan --> Some data may have been lost because they are not available in Kafka any more; either the data was aged out by Kafka or the topic may have been deleted before all the data in the topic was processed. If you don't want your streaming query to fail on such cases, set the source option "failOnDataLoss" to "false".{noformat} Configuration: {noformat} Spark 2.4.0 Spark-sql-kafka 0.10{noformat} Looking at the Spark structured streaming query progress logs, it seems like the endOffsets computed for the next batch was actually smaller than the starting offset: *Microbatch Trigger 1:* {noformat} 2019/10/26 23:53:51 INFO utils.Logging[26]: 2019-10-26 23:53:51.767 : ( : Query { "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b", "runId" : "2d20d633-2768-446c-845b-893243361422", "name" : "StreamingProcessorName", "timestamp" : "2019-10-26T23:53:51.741Z", "batchId" : 2145898, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, "durationMs" : { "getEndOffset" : 0, "setOffsetRange" : 9, "triggerExecution" : 9 }, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaV2[Subscribe[kafka-topic-name]]", "startOffset" : { "kafka-topic-name" : { "2" : 10452513, "1" : 10454326, "3" : 10469196, "0" : 10503762 } }, "endOffset" : { "kafka-topic-name" : { "2" : 10452513, "1" : 10454326, "3" : 10469196, "0" : 10503762 } }, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0 } ], "sink" : { "description" : "ForeachBatchSink" } } in progress{noformat} *Next micro batch trigger:* {noformat} 2019/10/26 23:53:53 INFO utils.Logging[26]: 2019-10-26 23:53:53.951 : ( : Query { "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b", "runId" : "2d20d633-2768-446c-845b-893243361422", "name" : "StreamingProcessorName", "timestamp" : "2019-10-26T23:53:52.907Z", "batchId" : 2145898, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0, "durationMs" : { "addBatch" : 350, "getBatch" : 4, "getEndOffset" : 0, "queryPlanning" : 102, "setOffsetRange" : 24, "triggerExecution" : 1043, "walCommit" : 349 }, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaV2[Subscribe[kafka-topic-name]]", "startOffset" : { "kafka-topic-name" : { "2" : 10452513, "1" : 10454326, "3" : 10469196, "0" : 10503762 } }, "endOffset" : { "kafka-topic-name" : { "2" : 10452513, "1" : 10454326, "3" : 9773098, "0" : 10503762 } }, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "processedRowsPerSecond" : 0.0 } ], "sink" : { "description" : "ForeachBatchSink" } } in progress{noformat} Notice that for partition 3 of the kafka topic, the endOffsets are actually smaller than the starting offsets! Checked the HDFS checkpoint dir and the checkpointed offsets look fine and point to the last committed offsets Why is the end offset for a partition being computed to a smaller value? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28938) Move to supported OpenJDK docker image for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-28938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962131#comment-16962131 ] Jim Kleckner commented on SPARK-28938: -- I created a minor one-line patch fix for this here. Let me know if it needs a different story. [https://github.com/apache/spark/pull/26296] > Move to supported OpenJDK docker image for Kubernetes > - > > Key: SPARK-28938 > URL: https://issues.apache.org/jira/browse/SPARK-28938 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.4, 3.0.0 > Environment: Kubernetes >Reporter: Rodney Aaron Stainback >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > Attachments: cve-spark-py.txt, cve-spark-r.txt, cve-spark.txt, > twistlock.txt > > > The current docker image used by Kubernetes > {code:java} > openjdk:8-alpine{code} > is not supported > [https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links] > It was removed with this commit > [https://github.com/docker-library/openjdk/commit/3eb0351b208d739fac35345c85e3c6237c2114ec#diff-f95ffa3d134732c33f7b8368e099] > Quote from commit "4. no more OpenJDK 8 Alpine images (Alpine/musl is not > officially supported by the OpenJDK project, so this reflects that -- see > "Project Portola" for the Alpine porting efforts which I understand are still > in need of help)" > > Please move to a supported image for Kubernetes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29595) Insertion with named_struct should match by name
[ https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962096#comment-16962096 ] Aman Omer commented on SPARK-29595: --- Thanks [~srowen] . Kindly check PR [https://github.com/apache/spark/pull/26293] . > Insertion with named_struct should match by name > > > Key: SPARK-29595 > URL: https://issues.apache.org/jira/browse/SPARK-29595 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > {code:java} > spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', > 2) as data); > spark-sql> insert into str values named_struct("b", 3, "a", 1); > spark-sql> select * from str; > {"a":3,"b":1} > {"a":1,"b":2} > {code} > The result should be > {code:java} > {"a":1,"b":3} > {"a":1,"b":2} > {code} > Spark should match the field names of named_struct on insertion -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29595) Insertion with named_struct should match by name
[ https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Omer updated SPARK-29595: -- Comment: was deleted (was: Raised a PR [https://github.com/apache/spark/pull/26293] . Kindly review. Thanks.) > Insertion with named_struct should match by name > > > Key: SPARK-29595 > URL: https://issues.apache.org/jira/browse/SPARK-29595 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > {code:java} > spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', > 2) as data); > spark-sql> insert into str values named_struct("b", 3, "a", 1); > spark-sql> select * from str; > {"a":3,"b":1} > {"a":1,"b":2} > {code} > The result should be > {code:java} > {"a":1,"b":3} > {"a":1,"b":2} > {code} > Spark should match the field names of named_struct on insertion -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28478) Optimizer rule to remove unnecessary explicit null checks for null-intolerant expressions (e.g. if(x is null, x, f(x)))
[ https://issues.apache.org/jira/browse/SPARK-28478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962029#comment-16962029 ] David Vrba commented on SPARK-28478: If no one is working on this, i will take this one. > Optimizer rule to remove unnecessary explicit null checks for null-intolerant > expressions (e.g. if(x is null, x, f(x))) > --- > > Key: SPARK-28478 > URL: https://issues.apache.org/jira/browse/SPARK-28478 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > I ran across a family of expressions like > {code:java} > if(x is null, x, substring(x, 0, 1024)){code} > or > {code:java} > when($"x".isNull, $"x", substring($"x", 0, 1024)){code} > that were written this way because the query author was unsure about whether > {{substring}} would return {{null}} when its input string argument is null. > This explicit null-handling is unnecessary and adds bloat to the generated > code, especially if it's done via a {{CASE}} statement (which compiles down > to a {{do-while}} loop). > In another case I saw a query compiler which automatically generated this > type of code. > It would be cool if Spark could automatically optimize such queries to remove > these redundant null checks. Here's a sketch of what such a rule might look > like (assuming that SPARK-28477 has been implement so we only need to worry > about the {{IF}} case): > * In the pattern match, check the following three conditions in the > following order (to benefit from short-circuiting) > ** The {{IF}} condition is an explicit null-check of a column {{c}} > ** The {{true}} expression returns either {{c}} or {{null}} > ** The {{false}} expression is a _null-intolerant_ expression with {{c}} as > a _direct_ child. > * If this condition matches, replace the entire {{If}} with the {{false}} > branch's expression.. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962021#comment-16962021 ] Yuming Wang commented on SPARK-27763: - +1 for porting: {noformat} alter_table.sql create_table.sql groupingsets.sql insert.sql limit.sql {noformat} > Port test cases from PostgreSQL to Spark SQL > > > Key: SPARK-27763 > URL: https://issues.apache.org/jira/browse/SPARK-27763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > To improve the test coverage, we can port the regression tests from the other > popular open source projects to Spark SQL. PostgreSQL is one of the best SQL > systems. Below are the links to the test cases and results. > * Regression test cases: > [https://github.com/postgres/postgres/tree/master/src/test/regress/sql] > * Expected results: > [https://github.com/postgres/postgres/tree/master/src/test/regress/expected] > Spark SQL does not support all the feature sets of PostgreSQL. In the current > stage, we should first comment out these test cases and create the > corresponding JIRAs in SPARK-27764. We can discuss and prioritize which > features we should support. Also, these PostgreSQL regression tests could > also expose the existing bugs of Spark SQL. We should also create the JIRAs > and track them in SPARK-27764. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian
[ https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-26985: - Fix Version/s: 2.4.5 > Test "access only some column of the all of columns " fails on big endian > - > > Key: SPARK-26985 > URL: https://issues.apache.org/jira/browse/SPARK-26985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Linux Ubuntu 16.04 > openjdk version "1.8.0_202" > OpenJDK Runtime Environment (build 1.8.0_202-b08) > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed > References 20190205_218 (JIT enabled, AOT enabled) > OpenJ9 - 90dd8cb40 > OMR - d2f4534b > JCL - d002501a90 based on jdk8u202-b08) > >Reporter: Anuja Jakhade >Assignee: ketan kunde >Priority: Major > Labels: BigEndian, correctness > Fix For: 2.4.5, 3.0.0 > > Attachments: DataFrameTungstenSuite.txt, > InMemoryColumnarQuerySuite.txt, access only some column of the all of > columns.txt > > > While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am > observing test failures for 2 Suites of Project SQL. > 1. InMemoryColumnarQuerySuite > 2. DataFrameTungstenSuite > In both the cases test "access only some column of the all of columns" fails > due to mismatch in the final assert. > Observed that the data obtained after df.cache() is causing the error. Please > find attached the log with the details. > cache() works perfectly fine if double and float values are not in picture. > Inside test !!- access only some column of the all of columns *** FAILED > *** -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29628) Forcibly create a temporary view in CREATE VIEW if referencing a temporary view
[ https://issues.apache.org/jira/browse/SPARK-29628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961973#comment-16961973 ] Takeshi Yamamuro commented on SPARK-29628: -- Thanks! > Forcibly create a temporary view in CREATE VIEW if referencing a temporary > view > --- > > Key: SPARK-29628 > URL: https://issues.apache.org/jira/browse/SPARK-29628 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > {code} > CREATE TEMPORARY VIEW temp_table AS SELECT * FROM VALUES > (1, 1) as temp_table(a, id); > CREATE VIEW v1_temp AS SELECT * FROM temp_table; > // In Spark > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v1_temp` by referencing a temporary > view `temp_table`; > // In PostgreSQL > NOTICE: view "v1_temp" will be a temporary view > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961970#comment-16961970 ] Takeshi Yamamuro commented on SPARK-27763: -- Hi, all, I've checked the other left (not-merged-yet) regressions tests in PgSQL: [https://github.com/postgres/postgres/tree/REL_12_STABLE/src/test/regress/sql] In my opinion, there are more five tests (see a list in the bottom) that we need to check for porting, because they are more basic ones than the others. If there is no problem for that, I'll check the left five tests (make PRs if necessary) then resolve this parent issue as resolved. WDYT? [~smilegator] [~dongjoon] [~hyukjin.kwon] [~yumwang] {code:java} -- the left regression tests we need to check alter_table.sql create_table.sql groupingsets.sql insert.sql limit.sql -- the regression tests we don't port now advisory_lock.sql alter_generic.sql alter_operator.sql amutils.sql arrays.sql async.sql bit.sql bitmapops.sql box.sql brin.sql btree_index.sql char.sql circle.sql cluster.sql collate.icu.utf8.sql collate.linux.utf8.sql collate.sql combocid.sql conversion.sql copy2.sql copydml.sql copyselect.sql create_aggregate.sql create_am.sql create_cast.sql create_function_3.sql create_index.sql create_index_spgist.sql create_misc.sql create_operator.sql create_procedure.sql create_table_like.sql create_type.sql dbsize.sql delete.sql dependency.sql domain.sql drop_if_exists.sql drop_operator.sql enum.sql equivclass.sql errors.sql event_trigger.sql expressions.sql fast_default.sql foreign_data.sql foreign_key.sql functional_deps.sql generated.sql geometry.sql gin.sql gist.sql guc.sql hash_func.sql hash_index.sql hash_part.sql horology.sql hs_primary_extremes.sql hs_primary_setup.sql hs_standby_allowed.sql hs_standby_check.sql hs_standby_disallowed.sql hs_standby_functions.sql identity.sql index_including.sql index_including_gist.sql indexing.sql indirect_toast.sql inet.sql inherit.sql init_privs.sql insert_conflict.sql join_hash.sql json.sql json_encoding.sql jsonb.sql jsonb_jsonpath.sql jsonpath.sql jsonpath_encoding.sql line.sql lock.sql lseg.sql macaddr.sql macaddr8.sql matview.sql misc_functions.sql misc_sanity.sql money.sql name.sql namespace.sql numeric_big.sql numerology.sql object_address.sql oid.sql oidjoins.sql opr_sanity.sql partition_aggregate.sql partition_info.sql partition_join.sql partition_prune.sql password.sql path.sql pg_lsn.sql pg_regress.list plancache.sql plpgsql.sql point.sql polygon.sql polymorphism.sql portals.sql portals_p2.sql prepare.sql prepared_xacts.sql privileges.sql psql.sql psql_crosstab.sql publication.sql random.sql rangefuncs.sql rangetypes.sql regex.linux.utf8.sql regex.sql regproc.sql reindex_catalog.sql reloptions.sql replica_identity.sql returning.sql roleattributes.sql rowsecurity.sql rowtypes.sql rules.sql sanity_check.sql security_label.sql select_distinct_on.sql select_into.sql select_parallel.sql select_views.sql sequence.sql spgist.sql stats.sql stats_ext.sql subscription.sql subselect.sql sysviews.sql tablesample.sql temp.sql tidscan.sql time.sql timestamptz.sql timetz.sql transactions.sql triggers.sql truncate.sql tsdicts.sql tsearch.sql tsrf.sql tstypes.sql txid.sql type_sanity.sql typed_table.sql updatable_views.sql update.sql uuid.sql vacuum.sql varchar.sql write_parallel.sql xml.sql xmlmap.sql {code} > Port test cases from PostgreSQL to Spark SQL > > > Key: SPARK-27763 > URL: https://issues.apache.org/jira/browse/SPARK-27763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > To improve the test coverage, we can port the regression tests from the other > popular open source projects to Spark SQL. PostgreSQL is one of the best SQL > systems. Below are the links to the test cases and results. > * Regression test cases: > [https://github.com/postgres/postgres/tree/master/src/test/regress/sql] > * Expected results: > [https://github.com/postgres/postgres/tree/master/src/test/regress/expected] > Spark SQL does not support all the feature sets of PostgreSQL. In the current > stage, we should first comment out these test cases and create the > corresponding JIRAs in SPARK-27764. We can discuss and prioritize which > features we should support. Also, these PostgreSQL regression tests could > also expose the existing bugs of Spark SQL. We should also create the JIRAs > and track them in SPARK-27764. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29595) Insertion with named_struct should match by name
[ https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961964#comment-16961964 ] Sean R. Owen commented on SPARK-29595: -- As I recall that behavior is intended to match SQL semantics? I don't recall the details of the discussion. > Insertion with named_struct should match by name > > > Key: SPARK-29595 > URL: https://issues.apache.org/jira/browse/SPARK-29595 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > {code:java} > spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', > 2) as data); > spark-sql> insert into str values named_struct("b", 3, "a", 1); > spark-sql> select * from str; > {"a":3,"b":1} > {"a":1,"b":2} > {code} > The result should be > {code:java} > {"a":1,"b":3} > {"a":1,"b":2} > {code} > Spark should match the field names of named_struct on insertion -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29451) Some queries with divisions in SQL windows are failling in Thrift
[ https://issues.apache.org/jira/browse/SPARK-29451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29451: - Parent: SPARK-27764 Issue Type: Sub-task (was: Bug) > Some queries with divisions in SQL windows are failling in Thrift > - > > Key: SPARK-29451 > URL: https://issues.apache.org/jira/browse/SPARK-29451 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > Hello, > the following queries are not properly working on Thrift. The only difference > between them and some other queries that works fine are the numeric > divisions, I think. > {code:sql} > SELECT four, ten/4 as two, > sum(ten/4) over (partition by four order by ten/4 rows between unbounded > preceding and current row), > last(ten/4) over (partition by four order by ten/4 rows between unbounded > preceding and current row) > FROM (select distinct ten, four from tenk1) ss; > {code} > {code:sql} > SELECT four, ten/4 as two, > sum(ten/4) over (partition by four order by ten/4 range between unbounded > preceding and current row), > last(ten/4) over (partition by four order by ten/4 range between unbounded > preceding and current row) > FROM (select distinct ten, four from tenk1) ss; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29451) Some queries with divisions in SQL windows are failling in Thrift
[ https://issues.apache.org/jira/browse/SPARK-29451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29451: - Parent: (was: SPARK-27763) Issue Type: Bug (was: Sub-task) > Some queries with divisions in SQL windows are failling in Thrift > - > > Key: SPARK-29451 > URL: https://issues.apache.org/jira/browse/SPARK-29451 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > Hello, > the following queries are not properly working on Thrift. The only difference > between them and some other queries that works fine are the numeric > divisions, I think. > {code:sql} > SELECT four, ten/4 as two, > sum(ten/4) over (partition by four order by ten/4 rows between unbounded > preceding and current row), > last(ten/4) over (partition by four order by ten/4 rows between unbounded > preceding and current row) > FROM (select distinct ten, four from tenk1) ss; > {code} > {code:sql} > SELECT four, ten/4 as two, > sum(ten/4) over (partition by four order by ten/4 range between unbounded > preceding and current row), > last(ten/4) over (partition by four order by ten/4 range between unbounded > preceding and current row) > FROM (select distinct ten, four from tenk1) ss; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing
[ https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961899#comment-16961899 ] siddartha sagar chinne commented on SPARK-27676: hi [~ste...@apache.org] can this fix be made available in spark 2.x. it would really help us running spark on aws until spark 3 is released. > InMemoryFileIndex should hard-fail on missing files instead of logging and > continuing > - > > Key: SPARK-27676 > URL: https://issues.apache.org/jira/browse/SPARK-27676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 3.0.0 > > > Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} > exceptions are caught and logged as warnings (during [directory > listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274] > and [block location > lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]). > I think that this is a dangerous default behavior and would prefer that > Spark hard-fails by default (with the ignore-and-continue behavior guarded by > a SQL session configuration). > In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. > Quoting from the PR for SPARK-17599: > {quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. > If a folder is deleted at any time between the paths were resolved and the > file catalog can check for the folder, the Spark job fails. This may abruptly > stop long running StructuredStreaming jobs for example. > Folders may be deleted by users or automatically by retention policies. These > cases should not prevent jobs from successfully completing. > {quote} > Let's say that I'm *not* expecting to ever delete input files for my job. In > that case, this behavior can mask bugs. > One straightforward masked bug class is accidental file deletion: if I'm > never expecting to delete files then I'd prefer to fail my job if Spark sees > deleted files. > A more subtle bug can occur when using a S3 filesystem. Say I'm running a > Spark job against a partitioned Parquet dataset which is laid out like this: > {code:java} > data/ > date=1/ > region=west/ >0.parquet >1.parquet > region=east/ >0.parquet >1.parquet{code} > If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform > multiple rounds of file listing, first listing {{/data/date=1}} to discover > the partitions for that date, then listing within each partition to discover > the leaf files. Due to the eventual consistency of S3 ListObjects, it's > possible that the first listing will show the {{region=west}} and > {{region=east}} partitions existing and then the next-level listing fails to > return any for some of the directories (e.g. {{/data/date=1/}} returns files > but {{/data/date=1/region=west/}} throws a {{FileNotFoundException}} in S3A > due to ListObjects inconsistency). > If Spark propagated the {{FileNotFoundException}} and hard-failed in this > case then I'd be able to fail the job in this case where we _definitely_ know > that the S3 listing is inconsistent (failing here doesn't guard against _all_ > potential S3 list inconsistency issues (e.g. back-to-back listings which both > return a subset of the true set of objects), but I think it's still an > improvement to fail for the subset of cases that we _can_ detect even if > that's not a surefire failsafe against the more general problem). > Finally, I'm unsure if the original patch will have the desired effect: if a > file is deleted once a Spark job expects to read it then that can cause > problems at multiple layers, both in the driver (multiple rounds of file > listing) and in executors (if the deletion occurs after the construction of > the catalog but before the scheduling of the read tasks); I think the > original patch only resolved the problem for the driver (unless I'm missing > similar executor-side code specific to the original streaming use-case). > Given all of these reasons, I think that the "ignore potentially deleted > files during file index listing" behavior should be guarded behind a feature > flag which defaults to {{false}}, consistent with the existing > {{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}} > flags (which both default to false). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description
[ https://issues.apache.org/jira/browse/SPARK-29637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961872#comment-16961872 ] Gabor Somogyi commented on SPARK-29637: --- I'm working on this. > SHS Endpoint /applications//jobs/ doesn't include description > - > > Key: SPARK-29637 > URL: https://issues.apache.org/jira/browse/SPARK-29637 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > Starting from Spark 2.3, the SHS REST API endpoint > /applications//jobs/ is not including description in the JobData > returned. This is not the case until Spark 2.2. > Steps to reproduce: > * Open spark-shell > {code:java} > scala> sc.setJobGroup("test", "job", false); > scala> val foo = sc.textFile("/user/foo.txt"); > foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at > textFile at :24 > scala> foo.foreach(println); > {code} > * Access end REST API > [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/ > * REST API of Spark 2.3 and above will not return description -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29638) Spark handles 'NaN' as 0 in sums
Dylan Guedes created SPARK-29638: Summary: Spark handles 'NaN' as 0 in sums Key: SPARK-29638 URL: https://issues.apache.org/jira/browse/SPARK-29638 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Dylan Guedes Currently, Spark handles 'NaN' as 0 in window functions, such that 3+'NaN'=3. PgSQL, on the other hand, handles the entire result as 'NaN', as in 3+'NaN' = 'NaN' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description
[ https://issues.apache.org/jira/browse/SPARK-29637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-29637: -- Description: Starting from Spark 2.3, the SHS REST API endpoint /applications//jobs/ is not including description in the JobData returned. This is not the case until Spark 2.2. Steps to reproduce: * Open spark-shell {code:java} scala> sc.setJobGroup("test", "job", false); scala> val foo = sc.textFile("/user/foo.txt"); foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at textFile at :24 scala> foo.foreach(println); {code} * Access end REST API [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/ * REST API of Spark 2.3 and above will not return description was: Starting from Spark 2.3, the SHS REST API endpoint /applications//jobs/ is not including description in the JobData returned. This is not the case until Spark 2.2. Steps to reproduce: * Open spark-shell {code:java} scala> sc.setJobGroup("test", "job", false); scala> val foo = sc.textFile("/user/foo.txt"); thorn: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at textFile at :24 scala> foo.foreach(println); {code} * Access end REST API http://SHS-host:port/api/v1/applications//jobs/ * REST API of Spark 2.3 and above will not return description > SHS Endpoint /applications//jobs/ doesn't include description > - > > Key: SPARK-29637 > URL: https://issues.apache.org/jira/browse/SPARK-29637 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > Starting from Spark 2.3, the SHS REST API endpoint > /applications//jobs/ is not including description in the JobData > returned. This is not the case until Spark 2.2. > Steps to reproduce: > * Open spark-shell > {code:java} > scala> sc.setJobGroup("test", "job", false); > scala> val foo = sc.textFile("/user/foo.txt"); > foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at > textFile at :24 > scala> foo.foreach(println); > {code} > * Access end REST API > [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/ > * REST API of Spark 2.3 and above will not return description -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description
[ https://issues.apache.org/jira/browse/SPARK-29637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-29637: -- Issue Type: Bug (was: Improvement) > SHS Endpoint /applications//jobs/ doesn't include description > - > > Key: SPARK-29637 > URL: https://issues.apache.org/jira/browse/SPARK-29637 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > Starting from Spark 2.3, the SHS REST API endpoint > /applications//jobs/ is not including description in the JobData > returned. This is not the case until Spark 2.2. > Steps to reproduce: > * Open spark-shell > {code:java} > scala> sc.setJobGroup("test", "job", false); > scala> val foo = sc.textFile("/user/foo.txt"); > foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at > textFile at :24 > scala> foo.foreach(println); > {code} > * Access end REST API > [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/ > * REST API of Spark 2.3 and above will not return description -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description
Gabor Somogyi created SPARK-29637: - Summary: SHS Endpoint /applications//jobs/ doesn't include description Key: SPARK-29637 URL: https://issues.apache.org/jira/browse/SPARK-29637 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.4, 2.3.4, 3.0.0 Reporter: Gabor Somogyi Starting from Spark 2.3, the SHS REST API endpoint /applications//jobs/ is not including description in the JobData returned. This is not the case until Spark 2.2. Steps to reproduce: * Open spark-shell {code:java} scala> sc.setJobGroup("test", "job", false); scala> val foo = sc.textFile("/user/foo.txt"); thorn: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at textFile at :24 scala> foo.foreach(println); {code} * Access end REST API http://SHS-host:port/api/v1/applications//jobs/ * REST API of Spark 2.3 and above will not return description -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29636) Can't parse '11:00 BST' or '2000-10-19 10:23:54+01' signatures to timestamp
Dylan Guedes created SPARK-29636: Summary: Can't parse '11:00 BST' or '2000-10-19 10:23:54+01' signatures to timestamp Key: SPARK-29636 URL: https://issues.apache.org/jira/browse/SPARK-29636 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Dylan Guedes Currently, Spark can't parse a string such as '11:00 BST' or '2000-10-19 10:23:54+01' to timestamp: {code:sql} spark-sql> select cast ('11:00 BST' as timestamp); NULL Time taken: 2.248 seconds, Fetched 1 row(s) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29595) Insertion with named_struct should match by name
[ https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961857#comment-16961857 ] Aman Omer commented on SPARK-29595: --- Raised a PR [https://github.com/apache/spark/pull/26293] . Kindly review. Thanks. > Insertion with named_struct should match by name > > > Key: SPARK-29595 > URL: https://issues.apache.org/jira/browse/SPARK-29595 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > {code:java} > spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', > 2) as data); > spark-sql> insert into str values named_struct("b", 3, "a", 1); > spark-sql> select * from str; > {"a":3,"b":1} > {"a":1,"b":2} > {code} > The result should be > {code:java} > {"a":1,"b":3} > {"a":1,"b":2} > {code} > Spark should match the field names of named_struct on insertion -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29634) spark-sql can't query hive table values with schema Char by equality.
[ https://issues.apache.org/jira/browse/SPARK-29634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961828#comment-16961828 ] Zhaoyang Qin edited comment on SPARK-29634 at 10/29/19 9:42 AM: About Char datatype,Hive Docs sees: [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-CharcharChar] But spark-sql CLI read this table by *select * from foo where bar = 'something'* , There will be no results.You must add Spaces like this: *select * from foo where bar = 'something '.* This causes normal SQL fail to execute correctly. And many TPCDS statements get no results,eg: sql7,sql71,sql 72 etc. was (Author: kelvin.fe): About Char datatype,Hive Docs sees: [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-CharcharChar] But spark-sql CLI read this table by *select * from foo where bar = 'something'* ,There will be no results.You must add Spaces like this: *select * from foo where bar = 'something '.* This causes normal SQL fail to execute correctly. And many TPCDS statements to get no results.eg: sql7,sql71,sql 72 etc. > spark-sql can't query hive table values with schema Char by equality. > - > > Key: SPARK-29634 > URL: https://issues.apache.org/jira/browse/SPARK-29634 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1, 2.3.1 > Environment: Spark 2.3.1 > Hive 3.0.0 > TPCDS Data & Tables >Reporter: Zhaoyang Qin >Priority: Major > Labels: spark-sql, spark-sql-perf > > spark-sql can't query hive table values that with schema Char by equality. > When i use spark-sql CLI to execute a query with hive tables,The expected > results can not be obtained. The query result is empty. Some equality > conditions did not work as expected.I checked and found that the table fields > had one thing in common: they were created as char,sometimes as varchar.Then > I execute the following statement and return an empty result: select * from > foo where bar = 'something'.In fact, the data does exist. Using hive sql > returns the correct results. > Simulation steps:(use spark-sql) > {code:java} > //spark-sql > {code} > ./spark-sql > > create table test1 (name char(10), age int); > > insert into test1 values('erya',15); > > select * from test1 where name = 'erya'; // no results > > select * from test1 where name = 'erya '; // add 6 spaces -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29634) spark-sql can't query hive table values with schema Char by equality.
[ https://issues.apache.org/jira/browse/SPARK-29634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhaoyang Qin updated SPARK-29634: - Summary: spark-sql can't query hive table values with schema Char by equality. (was: spark-sql can't query hive table values that with schema Char by equality.) > spark-sql can't query hive table values with schema Char by equality. > - > > Key: SPARK-29634 > URL: https://issues.apache.org/jira/browse/SPARK-29634 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1, 2.3.1 > Environment: Spark 2.3.1 > Hive 3.0.0 > TPCDS Data & Tables >Reporter: Zhaoyang Qin >Priority: Major > Labels: spark-sql, spark-sql-perf > > spark-sql can't query hive table values that with schema Char by equality. > When i use spark-sql CLI to execute a query with hive tables,The expected > results can not be obtained. The query result is empty. Some equality > conditions did not work as expected.I checked and found that the table fields > had one thing in common: they were created as char,sometimes as varchar.Then > I execute the following statement and return an empty result: select * from > foo where bar = 'something'.In fact, the data does exist. Using hive sql > returns the correct results. > Simulation steps:(use spark-sql) > {code:java} > //spark-sql > {code} > ./spark-sql > > create table test1 (name char(10), age int); > > insert into test1 values('erya',15); > > select * from test1 where name = 'erya'; // no results > > select * from test1 where name = 'erya '; // add 6 spaces -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29634) spark-sql can't query hive table values that with schema Char by equality.
[ https://issues.apache.org/jira/browse/SPARK-29634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhaoyang Qin updated SPARK-29634: - Description: spark-sql can't query hive table values that with schema Char by equality. When i use spark-sql CLI to execute a query with hive tables,The expected results can not be obtained. The query result is empty. Some equality conditions did not work as expected.I checked and found that the table fields had one thing in common: they were created as char,sometimes as varchar.Then I execute the following statement and return an empty result: select * from foo where bar = 'something'.In fact, the data does exist. Using hive sql returns the correct results. Simulation steps:(use spark-sql) {code:java} //spark-sql {code} ./spark-sql > create table test1 (name char(10), age int); > insert into test1 values('erya',15); > select * from test1 where name = 'erya'; // no results > select * from test1 where name = 'erya '; // add 6 spaces was: spark-sql can't query hive table values that with schema Char by equality. When i use spark-sql CLI to execute a query with hive tables,The expected results can not be obtained. The query result is empty. Some equality conditions did not work as expected.I checked and found that the table fields had one thing in common: they were created as char,sometimes as varchar.Then I execute the following statement and return an empty result: select * from foo where bar = 'something'.In fact, the data does exist. Using hive sql returns the correct results. > spark-sql can't query hive table values that with schema Char by equality. > -- > > Key: SPARK-29634 > URL: https://issues.apache.org/jira/browse/SPARK-29634 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1, 2.3.1 > Environment: Spark 2.3.1 > Hive 3.0.0 > TPCDS Data & Tables >Reporter: Zhaoyang Qin >Priority: Major > Labels: spark-sql, spark-sql-perf > > spark-sql can't query hive table values that with schema Char by equality. > When i use spark-sql CLI to execute a query with hive tables,The expected > results can not be obtained. The query result is empty. Some equality > conditions did not work as expected.I checked and found that the table fields > had one thing in common: they were created as char,sometimes as varchar.Then > I execute the following statement and return an empty result: select * from > foo where bar = 'something'.In fact, the data does exist. Using hive sql > returns the correct results. > Simulation steps:(use spark-sql) > {code:java} > //spark-sql > {code} > ./spark-sql > > create table test1 (name char(10), age int); > > insert into test1 values('erya',15); > > select * from test1 where name = 'erya'; // no results > > select * from test1 where name = 'erya '; // add 6 spaces -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29634) spark-sql can't query hive table values that with schema Char by equality.
[ https://issues.apache.org/jira/browse/SPARK-29634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961830#comment-16961830 ] Zhaoyang Qin commented on SPARK-29634: -- Hive doc says: {panel:title=Char} Char types are similar to Varchar but they are fixed-length meaning that values shorter than the specified length value are padded with spaces but trailing spaces are not important during comparisons. The maximum length is fixed at 255. {panel} > spark-sql can't query hive table values that with schema Char by equality. > -- > > Key: SPARK-29634 > URL: https://issues.apache.org/jira/browse/SPARK-29634 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1, 2.3.1 > Environment: Spark 2.3.1 > Hive 3.0.0 > TPCDS Data & Tables >Reporter: Zhaoyang Qin >Priority: Major > Labels: spark-sql, spark-sql-perf > > spark-sql can't query hive table values that with schema Char by equality. > When i use spark-sql CLI to execute a query with hive tables,The expected > results can not be obtained. The query result is empty. Some equality > conditions did not work as expected.I checked and found that the table fields > had one thing in common: they were created as char,sometimes as varchar.Then > I execute the following statement and return an empty result: select * from > foo where bar = 'something'.In fact, the data does exist. Using hive sql > returns the correct results. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29634) spark-sql can't query hive table values that with schema Char by equality.
[ https://issues.apache.org/jira/browse/SPARK-29634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961828#comment-16961828 ] Zhaoyang Qin commented on SPARK-29634: -- About Char datatype,Hive Docs sees: [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-CharcharChar] But spark-sql CLI read this table by *select * from foo where bar = 'something'* ,There will be no results.You must add Spaces like this: *select * from foo where bar = 'something '.* This causes normal SQL fail to execute correctly. And many TPCDS statements to get no results.eg: sql7,sql71,sql 72 etc. > spark-sql can't query hive table values that with schema Char by equality. > -- > > Key: SPARK-29634 > URL: https://issues.apache.org/jira/browse/SPARK-29634 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1, 2.3.1 > Environment: Spark 2.3.1 > Hive 3.0.0 > TPCDS Data & Tables >Reporter: Zhaoyang Qin >Priority: Major > Labels: spark-sql, spark-sql-perf > > spark-sql can't query hive table values that with schema Char by equality. > When i use spark-sql CLI to execute a query with hive tables,The expected > results can not be obtained. The query result is empty. Some equality > conditions did not work as expected.I checked and found that the table fields > had one thing in common: they were created as char,sometimes as varchar.Then > I execute the following statement and return an empty result: select * from > foo where bar = 'something'.In fact, the data does exist. Using hive sql > returns the correct results. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29635) Deduplicate test suites between Kafka micro-batch sink and Kafka continuous sink
Jungtaek Lim created SPARK-29635: Summary: Deduplicate test suites between Kafka micro-batch sink and Kafka continuous sink Key: SPARK-29635 URL: https://issues.apache.org/jira/browse/SPARK-29635 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim There's a comment in KafkaContinuousSinkSuite which is most likely explaining TODO: https://github.com/apache/spark/blob/37690dea107623ebca1e47c64db59196ee388f2f/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala#L35-L39 {noformat} /** * This is a temporary port of KafkaSinkSuite, since we do not yet have a V2 memory stream. * Once we have one, this will be changed to a specialization of KafkaSinkSuite and we won't have * to duplicate all the code. */ {noformat} Given latest master branch has V2 memory stream now, it is a good time to deduplicate two suites into one, via having base class and let these suites override necessary things. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29634) spark-sql can't query hive table values that with schema Char by equality.
Zhaoyang Qin created SPARK-29634: Summary: spark-sql can't query hive table values that with schema Char by equality. Key: SPARK-29634 URL: https://issues.apache.org/jira/browse/SPARK-29634 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 2.3.1, 2.2.1 Environment: Spark 2.3.1 Hive 3.0.0 TPCDS Data & Tables Reporter: Zhaoyang Qin spark-sql can't query hive table values that with schema Char by equality. When i use spark-sql CLI to execute a query with hive tables,The expected results can not be obtained. The query result is empty. Some equality conditions did not work as expected.I checked and found that the table fields had one thing in common: they were created as char,sometimes as varchar.Then I execute the following statement and return an empty result: select * from foo where bar = 'something'.In fact, the data does exist. Using hive sql returns the correct results. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
[ https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961797#comment-16961797 ] Aman Omer commented on SPARK-29630: --- cc [~srowen] [~LI,Xiao] [~cloud_fan] > Not allowed to create a permanent view by referencing a temporary view in > EXISTS > > > Key: SPARK-29630 > URL: https://issues.apache.org/jira/browse/SPARK-29630 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > // In the master, the query below fails > $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * > FROM temp_table) t2; > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v7_temp` by referencing a temporary > view `temp_table`; > // In the master, the query below passed, but this should fail > $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM > temp_table); > Passed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
[ https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961785#comment-16961785 ] Takeshi Yamamuro edited comment on SPARK-29630 at 10/29/19 8:19 AM: This is a logical view, so I think the view definition holds a logical plan depending on `temp_table`. The execution of the query might work well because of some optimization though. was (Author: maropu): This is a logical view, so I think the view definition holds a logical plan depending on `temp_table`. > Not allowed to create a permanent view by referencing a temporary view in > EXISTS > > > Key: SPARK-29630 > URL: https://issues.apache.org/jira/browse/SPARK-29630 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > // In the master, the query below fails > $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * > FROM temp_table) t2; > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v7_temp` by referencing a temporary > view `temp_table`; > // In the master, the query below passed, but this should fail > $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM > temp_table); > Passed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
[ https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961785#comment-16961785 ] Takeshi Yamamuro edited comment on SPARK-29630 at 10/29/19 8:18 AM: This is a logical view, so I think the view definition holds a logical plan depending on `temp_table`. was (Author: maropu): Ah, I see. I'll recheck later. > Not allowed to create a permanent view by referencing a temporary view in > EXISTS > > > Key: SPARK-29630 > URL: https://issues.apache.org/jira/browse/SPARK-29630 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > // In the master, the query below fails > $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * > FROM temp_table) t2; > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v7_temp` by referencing a temporary > view `temp_table`; > // In the master, the query below passed, but this should fail > $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM > temp_table); > Passed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29595) Insertion with named_struct should match by name
[ https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961783#comment-16961783 ] Aman Omer edited comment on SPARK-29595 at 10/29/19 8:18 AM: - {color:#172b4d}named_struct takes Seq(name1, val1, name2, val2, ...) as parameter. Vmalidation step for named_struct only check for string at odd places. For example following query will add a row in _str_ table.{color} {code:java} insert into str values named_struct( "ab", 1, "ba", 2);{code} According to the discussion in [https://github.com/apache/spark/pull/26275] , which was tackling similar issue, changing fields of struct type according to names will introduce complexity. So I think Spark should throw an exception when names does not match in named_struct. cc [~srowen] [~maropu] was (Author: aman_omer): {color:#172b4d}Parameter required for named_struct is Seq(name1, val1, name2, val2, ...). Validation step for named_struct only check for string at odd places. For example following query will add a row in _str_ table.{color} {code:java} insert into str values named_struct( "ab", 1, "ba", 2);{code} According to the discussion in [https://github.com/apache/spark/pull/26275] , which was tackling similar issue, changing fields of struct type according to names will introduce complexity. So I think Spark should throw an exception when names does not match in named_struct. cc [~srowen] [~maropu] > Insertion with named_struct should match by name > > > Key: SPARK-29595 > URL: https://issues.apache.org/jira/browse/SPARK-29595 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > {code:java} > spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', > 2) as data); > spark-sql> insert into str values named_struct("b", 3, "a", 1); > spark-sql> select * from str; > {"a":3,"b":1} > {"a":1,"b":2} > {code} > The result should be > {code:java} > {"a":1,"b":3} > {"a":1,"b":2} > {code} > Spark should match the field names of named_struct on insertion -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29595) Insertion with named_struct should match by name
[ https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961783#comment-16961783 ] Aman Omer edited comment on SPARK-29595 at 10/29/19 8:17 AM: - {color:#172b4d}Parameter required for named_struct is Seq(name1, val1, name2, val2, ...). Validation step for named_struct only check for string at odd places. For example following query will add a row in _str_ table.{color} {code:java} insert into str values named_struct( "ab", 1, "ba", 2);{code} According to the discussion in [https://github.com/apache/spark/pull/26275] , which was tackling similar issue, changing fields of struct type according to names will introduce complexity. So I think Spark should throw an exception when names does not match in named_struct. cc [~srowen] [~maropu] was (Author: aman_omer): {color:#172b4d}Parameters required for named_struct is Seq(name1, val1, name2, val2, ...). Validation step for named_struct only check for string at odd places. For example following query will add a row in _str_ table.{color}{color} {code:java} insert into str values named_struct( "ab", 1, "ba", 2);{code} According to the discussion in [https://github.com/apache/spark/pull/26275] , which was tackling similar issue, changing fields of struct type according to names will introduce complexity. So I think Spark should throw an exception when names does not match in named_struct. cc [~srowen] [~maropu] > Insertion with named_struct should match by name > > > Key: SPARK-29595 > URL: https://issues.apache.org/jira/browse/SPARK-29595 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > {code:java} > spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', > 2) as data); > spark-sql> insert into str values named_struct("b", 3, "a", 1); > spark-sql> select * from str; > {"a":3,"b":1} > {"a":1,"b":2} > {code} > The result should be > {code:java} > {"a":1,"b":3} > {"a":1,"b":2} > {code} > Spark should match the field names of named_struct on insertion -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
[ https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29630: - Parent: SPARK-27764 Issue Type: Sub-task (was: Improvement) > Not allowed to create a permanent view by referencing a temporary view in > EXISTS > > > Key: SPARK-29630 > URL: https://issues.apache.org/jira/browse/SPARK-29630 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > // In the master, the query below fails > $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * > FROM temp_table) t2; > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v7_temp` by referencing a temporary > view `temp_table`; > // In the master, the query below passed, but this should fail > $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM > temp_table); > Passed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29595) Insertion with named_struct should match by name
[ https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961783#comment-16961783 ] Aman Omer edited comment on SPARK-29595 at 10/29/19 8:13 AM: - {color:#172b4d}Parameters required for named_struct is Seq(name1, val1, name2, val2, ...). Validation step for named_struct only check for string at odd places. For example following query will add a row in _str_ table.{color}{color} {code:java} insert into str values named_struct( "ab", 1, "ba", 2);{code} According to the discussion in [https://github.com/apache/spark/pull/26275] , which was tackling similar issue, changing fields of struct type according to names will introduce complexity. So I think Spark should throw an exception when names does not match in named_struct. cc [~srowen] [~maropu] was (Author: aman_omer): Parameters required for named_struct is _{color:#172b4d}Seq(name1, val1, name2, val2, ...){color}_{color:#808080}{color:#172b4d}. Validation step for named_struct only check for string at odd places. For example following query will add a row in _str_ table.{color} {color} {code:java} insert into str values named_struct( "ab", 1, "ba", 2);{code} According to the discussion in [https://github.com/apache/spark/pull/26275] , which was tackling similar issue, changing fields of struct type according to names will introduce complexity. So I think Spark should throw an exception when names does not match in named_struct. cc [~srowen] [~maropu] > Insertion with named_struct should match by name > > > Key: SPARK-29595 > URL: https://issues.apache.org/jira/browse/SPARK-29595 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > {code:java} > spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', > 2) as data); > spark-sql> insert into str values named_struct("b", 3, "a", 1); > spark-sql> select * from str; > {"a":3,"b":1} > {"a":1,"b":2} > {code} > The result should be > {code:java} > {"a":1,"b":3} > {"a":1,"b":2} > {code} > Spark should match the field names of named_struct on insertion -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
[ https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961785#comment-16961785 ] Takeshi Yamamuro commented on SPARK-29630: -- Ah, I see. I'll recheck later. > Not allowed to create a permanent view by referencing a temporary view in > EXISTS > > > Key: SPARK-29630 > URL: https://issues.apache.org/jira/browse/SPARK-29630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > // In the master, the query below fails > $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * > FROM temp_table) t2; > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v7_temp` by referencing a temporary > view `temp_table`; > // In the master, the query below passed, but this should fail > $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM > temp_table); > Passed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29595) Insertion with named_struct should match by name
[ https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961783#comment-16961783 ] Aman Omer commented on SPARK-29595: --- Parameters required for named_struct is _{color:#172b4d}Seq(name1, val1, name2, val2, ...){color}_{color:#808080}{color:#172b4d}. Validation step for named_struct only check for string at odd places. For example following query will add a row in _str_ table.{color} {color} {code:java} insert into str values named_struct( "ab", 1, "ba", 2);{code} According to the discussion in [https://github.com/apache/spark/pull/26275] , which was tackling similar issue, changing fields of struct type according to names will introduce complexity. So I think Spark should throw an exception when names does not match in named_struct. cc [~srowen] [~maropu] > Insertion with named_struct should match by name > > > Key: SPARK-29595 > URL: https://issues.apache.org/jira/browse/SPARK-29595 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > {code:java} > spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', > 2) as data); > spark-sql> insert into str values named_struct("b", 3, "a", 1); > spark-sql> select * from str; > {"a":3,"b":1} > {"a":1,"b":2} > {code} > The result should be > {code:java} > {"a":1,"b":3} > {"a":1,"b":2} > {code} > Spark should match the field names of named_struct on insertion -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29626) notEqual() should return true when the one is null, the other is not null
[ https://issues.apache.org/jira/browse/SPARK-29626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961770#comment-16961770 ] Aman Omer commented on SPARK-29626: --- Looking into this one. > notEqual() should return true when the one is null, the other is not null > - > > Key: SPARK-29626 > URL: https://issues.apache.org/jira/browse/SPARK-29626 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.4.4 >Reporter: zhouhuazheng >Priority: Minor > > the one is null,the other is not null, then use the function notEqual(), we > hope it return true . > eg: > scala> df.show() > +--+---+ > | age| name| > +--+---+ > | null|Michael| > | 30| Andy| > | 19| Justin| > | 35| null| > | 19| Justin| > | null| null| > |Justin| Justin| > | 19| 19| > +--+---+ > scala> df.filter(col("age").notEqual(col("name"))).show > +---+--+ > |age| name| > +---+--+ > | 30| Andy| > | 19|Justin| > | 19|Justin| > +---+--+ > scala> df.filter(col("age").equalTo(col("name"))).show > +--+--+ > | age| name| > +--+--+ > | null| null| > |Justin|Justin| > | 19| 19| > +--+--+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29222) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence
[ https://issues.apache.org/jira/browse/SPARK-29222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961766#comment-16961766 ] huangtianhua commented on SPARK-29222: -- [~hyukjin.kwon], the failure is releated with the performance of arm instance, now we donate an arm instance to AMPLab and we havn't build the python job yet, and we plan to donate some higher performance arm instances to AMPLab next month, if the issues happen again then I will create a pr to fix this. Thanks a lot. > Flaky test: > pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence > --- > > Key: SPARK-29222 > URL: https://issues.apache.org/jira/browse/SPARK-29222 > Project: Spark > Issue Type: Test > Components: MLlib, Tests >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111237/testReport/] > {code:java} > Error Message > 7 != 10 > StacktraceTraceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 429, in test_parameter_convergence > self._eventually(condition, catch_assertions=True) > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 74, in _eventually > raise lastValue > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 65, in _eventually > lastValue = condition() > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 425, in condition > self.assertEqual(len(model_weights), len(batches)) > AssertionError: 7 != 10 >{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
[ https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961765#comment-16961765 ] Aman Omer commented on SPARK-29630: --- [~maropu] I think when we project some constant in subquery, Spark will not link subquery's table/view. Hence table _v8_temp_ is not dependent on _temp_table_. Please check the output for following query. $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT * FROM temp_table); > Not allowed to create a permanent view by referencing a temporary view in > EXISTS > > > Key: SPARK-29630 > URL: https://issues.apache.org/jira/browse/SPARK-29630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > // In the master, the query below fails > $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * > FROM temp_table) t2; > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v7_temp` by referencing a temporary > view `temp_table`; > // In the master, the query below passed, but this should fail > $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM > temp_table); > Passed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961760#comment-16961760 ] huangtianhua commented on SPARK-29106: -- [~shaneknapp], hi, the arm job has built for some days, althought there are some failures due to poor performance of arm instance or the network upgrade, but the good news is that we got several successes, at least it means that spark supports on arm platform :) According the discussion 'Deprecate Python < 3.6 in Spark 3.0' I think now we can do the python test only for python3.6, it will be more simple for us? So how about to add a new arm job for python3.6 tests? > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29628) Forcibly create a temporary view in CREATE VIEW if referencing a temporary view
[ https://issues.apache.org/jira/browse/SPARK-29628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961758#comment-16961758 ] Aman Omer commented on SPARK-29628: --- Thanks [~maropu] . I will look into this one. > Forcibly create a temporary view in CREATE VIEW if referencing a temporary > view > --- > > Key: SPARK-29628 > URL: https://issues.apache.org/jira/browse/SPARK-29628 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > {code} > CREATE TEMPORARY VIEW temp_table AS SELECT * FROM VALUES > (1, 1) as temp_table(a, id); > CREATE VIEW v1_temp AS SELECT * FROM temp_table; > // In Spark > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v1_temp` by referencing a temporary > view `temp_table`; > // In PostgreSQL > NOTICE: view "v1_temp" will be a temporary view > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29633) Make COLUMN optional in ALTER TABLE
Takeshi Yamamuro created SPARK-29633: Summary: Make COLUMN optional in ALTER TABLE Key: SPARK-29633 URL: https://issues.apache.org/jira/browse/SPARK-29633 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro In the mater, the query below fails; {code} create table tt3 (ax bigint, b short, c decimal) using parquet; alter table tt3 rename c to d; {code} In PostgreSQL, we can make COLUMN optional; {code} ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ] RENAME [ COLUMN ] column_name TO new_column_name {code} https://www.postgresql.org/docs/current/sql-altertable.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29632) Support ALTER TABLE [relname] SET SCHEMA [dbname]
[ https://issues.apache.org/jira/browse/SPARK-29632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29632: - Description: {code} CREATE SCHEMA temp_view_test; CREATE TABLE tx1 (x1 int, x2 int, x3 string) using parquet; ALTER TABLE tx1 SET SCHEMA temp_view_test; {code} {code} ALTER TABLE [ IF EXISTS ] name SET SCHEMA new_schema {code} https://www.postgresql.org/docs/current/sql-altertable.html was: {code} CREATE SCHEMA temp_view_test; CREATE TABLE tx1 (x1 int, x2 int, x3 string) using parquet; ALTER TABLE tx1 SET SCHEMA temp_view_test; {code} https://www.postgresql.org/docs/current/sql-altertable.html > Support ALTER TABLE [relname] SET SCHEMA [dbname] > - > > Key: SPARK-29632 > URL: https://issues.apache.org/jira/browse/SPARK-29632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > CREATE SCHEMA temp_view_test; > CREATE TABLE tx1 (x1 int, x2 int, x3 string) using parquet; > ALTER TABLE tx1 SET SCHEMA temp_view_test; > {code} > {code} > ALTER TABLE [ IF EXISTS ] name > SET SCHEMA new_schema > {code} > https://www.postgresql.org/docs/current/sql-altertable.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29632) Support ALTER TABLE [relname] SET SCHEMA [dbname]
[ https://issues.apache.org/jira/browse/SPARK-29632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29632: - Parent: SPARK-27764 Issue Type: Sub-task (was: Improvement) > Support ALTER TABLE [relname] SET SCHEMA [dbname] > - > > Key: SPARK-29632 > URL: https://issues.apache.org/jira/browse/SPARK-29632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > CREATE SCHEMA temp_view_test; > CREATE TABLE tx1 (x1 int, x2 int, x3 string) using parquet; > ALTER TABLE tx1 SET SCHEMA temp_view_test; > {code} > https://www.postgresql.org/docs/current/sql-altertable.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29632) Support ALTER TABLE [relname] SET SCHEMA [dbname]
Takeshi Yamamuro created SPARK-29632: Summary: Support ALTER TABLE [relname] SET SCHEMA [dbname] Key: SPARK-29632 URL: https://issues.apache.org/jira/browse/SPARK-29632 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro {code} CREATE SCHEMA temp_view_test; CREATE TABLE tx1 (x1 int, x2 int, x3 string) using parquet; ALTER TABLE tx1 SET SCHEMA temp_view_test; {code} https://www.postgresql.org/docs/current/sql-altertable.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab
[ https://issues.apache.org/jira/browse/SPARK-29476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961723#comment-16961723 ] jobit mathew commented on SPARK-29476: -- [~hyukjin.kwon], I think it is better to have some tooltips in the Executors tab specially for Thread dump link[for most of the other columns tool tip is already added] to explain more information what it meant like *thread dump for executors and drivers*. And after clicking on thread dump link ,the next page contains the *search* and *thread table detail*s. In this page also add *tool tip* for *Search,*better to mention what and all it will search like it will search the content from table including stack trace details.And *tool tip* for *thread table column heading detail*s for better understanding. > Add tooltip information for Thread Dump links and Thread details table > columns in Executors Tab > --- > > Key: SPARK-29476 > URL: https://issues.apache.org/jira/browse/SPARK-29476 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29565) OneHotEncoder should support single-column input/ouput
[ https://issues.apache.org/jira/browse/SPARK-29565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-29565: --- Assignee: Huaxin Gao > OneHotEncoder should support single-column input/ouput > -- > > Key: SPARK-29565 > URL: https://issues.apache.org/jira/browse/SPARK-29565 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > > Current feature algs > ({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color}) > are designed to support both single-col & multi-col. > And there is already some internal utils (like > {color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this. > For OneHotEncoder, it's reasonable to support single-col. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29565) OneHotEncoder should support single-column input/ouput
[ https://issues.apache.org/jira/browse/SPARK-29565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-29565. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26265 [https://github.com/apache/spark/pull/26265] > OneHotEncoder should support single-column input/ouput > -- > > Key: SPARK-29565 > URL: https://issues.apache.org/jira/browse/SPARK-29565 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > Current feature algs > ({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color}) > are designed to support both single-col & multi-col. > And there is already some internal utils (like > {color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this. > For OneHotEncoder, it's reasonable to support single-col. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29631) Support ANSI SQL CREATE SEQUENCE
Takeshi Yamamuro created SPARK-29631: Summary: Support ANSI SQL CREATE SEQUENCE Key: SPARK-29631 URL: https://issues.apache.org/jira/browse/SPARK-29631 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro {code} CREATE SEQUENCE seq1; CREATE TEMPORARY SEQUENCE seq1_temp; CREATE VIEW v9 AS SELECT seq1.is_called FROM seq1; CREATE VIEW v13_temp AS SELECT seq1_temp.is_called FROM seq1_temp; {code} https://www.postgresql.org/docs/current/sql-createsequence.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961672#comment-16961672 ] Brandon edited comment on SPARK-24918 at 10/29/19 6:09 AM: --- [~nsheth] placing the plugin class inside a jar and passing as `–jars` to spark-submit should be sufficient, right? It seems this is not enough to make the class visible to the executor. I have had to explicitly add this jar to `spark.executor.extraClassPath` for plugins to load correctly. was (Author: brandonvin): [~nsheth] placing the plugin class inside a jar and passing as `–jars` to spark-submit should sufficient, right? It seems this is not enough to make the class visible to the executor. I have had to explicitly add this jar to `spark.executor.extraClassPath` for plugins to load correctly. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Nihar Sheth >Priority: Major > Labels: SPIP, memory-analysis > Fix For: 2.4.0 > > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
Takeshi Yamamuro created SPARK-29630: Summary: Not allowed to create a permanent view by referencing a temporary view in EXISTS Key: SPARK-29630 URL: https://issues.apache.org/jira/browse/SPARK-29630 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro {code} // In the master, the query below fails $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * FROM temp_table) t2; org.apache.spark.sql.AnalysisException Not allowed to create a permanent view `v7_temp` by referencing a temporary view `temp_table`; // In the master, the query below passed, but this should fail $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM temp_table); Passed {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org