[jira] [Commented] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet
[ https://issues.apache.org/jira/browse/SPARK-32755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192667#comment-17192667 ] Apache Spark commented on SPARK-32755: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29689 > Maintain the order of expressions in AttributeSet and ExpressionSet > > > Key: SPARK-32755 > URL: https://issues.apache.org/jira/browse/SPARK-32755 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ali Afroozeh >Assignee: Ali Afroozeh >Priority: Major > Fix For: 3.1.0 > > > Expressions identity is based on the ExprId which is an auto-incremented > number. This means that the same query can yield a query plan with different > expression ids in different runs. AttributeSet and ExpressionSet internally > use a HashSet as the underlying data structure, and therefore cannot > guarantee the a fixed order of operations in different runs. This can be > problematic in cases we like to check for plan changes in different runs. > We change do the following changes to AttributeSet and ExpressionSet to > maintain the insertion order of the elements: > * We change the underlying data structure of AttributeSet from HashSet to > LinkedHashSet to maintain the insertion order. > * ExpressionSet already uses a list to keep track of the expressions, > however, since it is extending Scala's immutable.Set class, operations such > as map and flatMap are delegated to the immutable.Set itself. This means that > the result of these operations is not an instance of ExpressionSet anymore, > rather it's a implementation picked up by the parent class. We also remove > this inheritance from immutable.Set and implement the needed methods > directly. ExpressionSet has a very specific semantics and it does not make > sense to extend immutable.Set anyway. > * We change the PlanStabilitySuite to not sort the attributes, to be able to > catch changes in the order of expressions in different runs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet
[ https://issues.apache.org/jira/browse/SPARK-32755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192665#comment-17192665 ] Apache Spark commented on SPARK-32755: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29689 > Maintain the order of expressions in AttributeSet and ExpressionSet > > > Key: SPARK-32755 > URL: https://issues.apache.org/jira/browse/SPARK-32755 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ali Afroozeh >Assignee: Ali Afroozeh >Priority: Major > Fix For: 3.1.0 > > > Expressions identity is based on the ExprId which is an auto-incremented > number. This means that the same query can yield a query plan with different > expression ids in different runs. AttributeSet and ExpressionSet internally > use a HashSet as the underlying data structure, and therefore cannot > guarantee the a fixed order of operations in different runs. This can be > problematic in cases we like to check for plan changes in different runs. > We change do the following changes to AttributeSet and ExpressionSet to > maintain the insertion order of the elements: > * We change the underlying data structure of AttributeSet from HashSet to > LinkedHashSet to maintain the insertion order. > * ExpressionSet already uses a list to keep track of the expressions, > however, since it is extending Scala's immutable.Set class, operations such > as map and flatMap are delegated to the immutable.Set itself. This means that > the result of these operations is not an instance of ExpressionSet anymore, > rather it's a implementation picked up by the parent class. We also remove > this inheritance from immutable.Set and implement the needed methods > directly. ExpressionSet has a very specific semantics and it does not make > sense to extend immutable.Set anyway. > * We change the PlanStabilitySuite to not sort the attributes, to be able to > catch changes in the order of expressions in different runs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192628#comment-17192628 ] Hyukjin Kwon commented on SPARK-32810: -- Thanks [~dongjoon] for fixing it here and in other JIRAs. > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.7, 3.1.0, 3.0.2 > > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32810: -- Affects Version/s: 2.4.6 3.0.0 3.0.1 > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.7, 3.1.0, 3.0.2 > > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32827) Add spark.sql.maxMetadataStringLength config
[ https://issues.apache.org/jira/browse/SPARK-32827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192621#comment-17192621 ] Apache Spark commented on SPARK-32827: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/29688 > Add spark.sql.maxMetadataStringLength config > > > Key: SPARK-32827 > URL: https://issues.apache.org/jira/browse/SPARK-32827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > > Add a new config `spark.sql.maxMetadataStringLength`. This config aims to > limit metadata value length, e.g. file location. > Found that metadata has been abbreviated by `...` when tried to add some test > in `SQLQueryTestSuite`. That caused we can't replace the location value by > `className` since the `className` has been abbreviated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32827) Add spark.sql.maxMetadataStringLength config
[ https://issues.apache.org/jira/browse/SPARK-32827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192620#comment-17192620 ] Apache Spark commented on SPARK-32827: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/29688 > Add spark.sql.maxMetadataStringLength config > > > Key: SPARK-32827 > URL: https://issues.apache.org/jira/browse/SPARK-32827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > > Add a new config `spark.sql.maxMetadataStringLength`. This config aims to > limit metadata value length, e.g. file location. > Found that metadata has been abbreviated by `...` when tried to add some test > in `SQLQueryTestSuite`. That caused we can't replace the location value by > `className` since the `className` has been abbreviated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32827) Add spark.sql.maxMetadataStringLength config
[ https://issues.apache.org/jira/browse/SPARK-32827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32827: Assignee: Apache Spark > Add spark.sql.maxMetadataStringLength config > > > Key: SPARK-32827 > URL: https://issues.apache.org/jira/browse/SPARK-32827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Minor > > Add a new config `spark.sql.maxMetadataStringLength`. This config aims to > limit metadata value length, e.g. file location. > Found that metadata has been abbreviated by `...` when tried to add some test > in `SQLQueryTestSuite`. That caused we can't replace the location value by > `className` since the `className` has been abbreviated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32827) Add spark.sql.maxMetadataStringLength config
[ https://issues.apache.org/jira/browse/SPARK-32827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32827: Assignee: (was: Apache Spark) > Add spark.sql.maxMetadataStringLength config > > > Key: SPARK-32827 > URL: https://issues.apache.org/jira/browse/SPARK-32827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > > Add a new config `spark.sql.maxMetadataStringLength`. This config aims to > limit metadata value length, e.g. file location. > Found that metadata has been abbreviated by `...` when tried to add some test > in `SQLQueryTestSuite`. That caused we can't replace the location value by > `className` since the `className` has been abbreviated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32827) Add spark.sql.maxMetadataStringLength config
ulysses you created SPARK-32827: --- Summary: Add spark.sql.maxMetadataStringLength config Key: SPARK-32827 URL: https://issues.apache.org/jira/browse/SPARK-32827 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: ulysses you Add a new config `spark.sql.maxMetadataStringLength`. This config aims to limit metadata value length, e.g. file location. Found that metadata has been abbreviated by `...` when tried to add some test in `SQLQueryTestSuite`. That caused we can't replace the location value by `className` since the `className` has been abbreviated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package
[ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192607#comment-17192607 ] Hyukjin Kwon commented on SPARK-32187: -- Hey [~fhoering] are you back now :-)? > User Guide - Shipping Python Package > > > Key: SPARK-32187 > URL: https://issues.apache.org/jira/browse/SPARK-32187 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Fabian Höring >Priority: Major > > - Zipped file > - Python files > - Virtualenv with Yarn > - PEX \(?\) (see also SPARK-25433) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32826) Add test case for get null columns using SparkGetColumnsOperation
[ https://issues.apache.org/jira/browse/SPARK-32826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192603#comment-17192603 ] Apache Spark commented on SPARK-32826: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29687 > Add test case for get null columns using SparkGetColumnsOperation > - > > Key: SPARK-32826 > URL: https://issues.apache.org/jira/browse/SPARK-32826 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Minor > > In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns > but now we can because the side effect of > https://issues.apache.org/jira/browse/SPARK-32696 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32826) Add test case for get null columns using SparkGetColumnsOperation
[ https://issues.apache.org/jira/browse/SPARK-32826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32826: Assignee: (was: Apache Spark) > Add test case for get null columns using SparkGetColumnsOperation > - > > Key: SPARK-32826 > URL: https://issues.apache.org/jira/browse/SPARK-32826 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Minor > > In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns > but now we can because the side effect of > https://issues.apache.org/jira/browse/SPARK-32696 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32826) Add test case for get null columns using SparkGetColumnsOperation
[ https://issues.apache.org/jira/browse/SPARK-32826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192602#comment-17192602 ] Apache Spark commented on SPARK-32826: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29687 > Add test case for get null columns using SparkGetColumnsOperation > - > > Key: SPARK-32826 > URL: https://issues.apache.org/jira/browse/SPARK-32826 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Minor > > In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns > but now we can because the side effect of > https://issues.apache.org/jira/browse/SPARK-32696 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32826) Add test case for get null columns using SparkGetColumnsOperation
[ https://issues.apache.org/jira/browse/SPARK-32826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32826: Assignee: Apache Spark > Add test case for get null columns using SparkGetColumnsOperation > - > > Key: SPARK-32826 > URL: https://issues.apache.org/jira/browse/SPARK-32826 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Minor > > In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns > but now we can because the side effect of > https://issues.apache.org/jira/browse/SPARK-32696 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32826) Add test case for get null columns using SparkGetColumnsOperation
Kent Yao created SPARK-32826: Summary: Add test case for get null columns using SparkGetColumnsOperation Key: SPARK-32826 URL: https://issues.apache.org/jira/browse/SPARK-32826 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: Kent Yao In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns but now we can because the side effect of https://issues.apache.org/jira/browse/SPARK-32696 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32187) User Guide - Shipping Python Package
[ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32187: - Description: - Zipped file - Python files - Virtualenv with Yarn - PEX \(?\) (see also SPARK-25433) was: - Zipped file - Python files - PEX \(?\) (see also SPARK-25433) > User Guide - Shipping Python Package > > > Key: SPARK-32187 > URL: https://issues.apache.org/jira/browse/SPARK-32187 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Fabian Höring >Priority: Major > > - Zipped file > - Python files > - Virtualenv with Yarn > - PEX \(?\) (see also SPARK-25433) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32813. -- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 29667 [https://github.com/apache/spark/pull/29667] > Reading parquet rdd in non columnar mode fails in multithreaded environment > --- > > Key: SPARK-32813 > URL: https://issues.apache.org/jira/browse/SPARK-32813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 > Environment: Spark 3.0.0, Scala 2.12.12 >Reporter: Vladimir Klyushnikov >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark > session was created in one thread and rdd is being read in another - so > InheritableThreadLocal with active session is not propagated. Code below was > working perfectly in Spark 2.X, but fails in Spark 3 > {code:scala} > import java.util.concurrent.Executors > import org.apache.spark.sql.SparkSession > import scala.concurrent.{Await, ExecutionContext, Future} > import scala.concurrent.duration._ > object Main { > final case class Data(list: List[Int]) > def main(args: Array[String]): Unit = { > val executor1 = Executors.newSingleThreadExecutor() > val executor2 = Executors.newSingleThreadExecutor() > try { > val ds = Await.result(Future { > val session = > SparkSession.builder().appName("test").master("local[*]").getOrCreate() > import session.implicits._ > val path = "test.parquet" > session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path) > session.read.parquet(path).as[Data] > }(ExecutionContext.fromExecutorService(executor1)), 1.minute) > Await.result(Future { > ds.rdd.collect().foreach(println(_)) > }(ExecutionContext.fromExecutorService(executor2)), 1.minute) > } finally { > executor1.shutdown() > executor2.shutdown() > } > } > } > {code} > This code fails with following exception: > {code} > Exception in thread "main" java.util.NoSuchElementException: > None.getException in thread "main" java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) > at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) > at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at > org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32813: Assignee: L. C. Hsieh > Reading parquet rdd in non columnar mode fails in multithreaded environment > --- > > Key: SPARK-32813 > URL: https://issues.apache.org/jira/browse/SPARK-32813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 > Environment: Spark 3.0.0, Scala 2.12.12 >Reporter: Vladimir Klyushnikov >Assignee: L. C. Hsieh >Priority: Major > > Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark > session was created in one thread and rdd is being read in another - so > InheritableThreadLocal with active session is not propagated. Code below was > working perfectly in Spark 2.X, but fails in Spark 3 > {code:scala} > import java.util.concurrent.Executors > import org.apache.spark.sql.SparkSession > import scala.concurrent.{Await, ExecutionContext, Future} > import scala.concurrent.duration._ > object Main { > final case class Data(list: List[Int]) > def main(args: Array[String]): Unit = { > val executor1 = Executors.newSingleThreadExecutor() > val executor2 = Executors.newSingleThreadExecutor() > try { > val ds = Await.result(Future { > val session = > SparkSession.builder().appName("test").master("local[*]").getOrCreate() > import session.implicits._ > val path = "test.parquet" > session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path) > session.read.parquet(path).as[Data] > }(ExecutionContext.fromExecutorService(executor1)), 1.minute) > Await.result(Future { > ds.rdd.collect().foreach(println(_)) > }(ExecutionContext.fromExecutorService(executor2)), 1.minute) > } finally { > executor1.shutdown() > executor2.shutdown() > } > } > } > {code} > This code fails with following exception: > {code} > Exception in thread "main" java.util.NoSuchElementException: > None.getException in thread "main" java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) > at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) > at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at > org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192574#comment-17192574 ] Jungtaek Lim commented on SPARK-24295: -- [~sta...@gmail.com] Thanks for sharing the workaround. I've proposed applying TTL on FileStreamSink output, which does the similar with your workaround, but purges for every compact batch. Unfortunately it hasn't made enough interest for committers, though. SPARK-27188 ([https://github.com/apache/spark/pull/28363]) > Purge Structured streaming FileStreamSinkLog metadata compact file data. > > > Key: SPARK-24295 > URL: https://issues.apache.org/jira/browse/SPARK-24295 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Iqbal Singh >Priority: Major > Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz > > > FileStreamSinkLog metadata logs are concatenated to a single compact file > after defined compact interval. > For long running jobs, compact file size can grow up to 10's of GB's, Causing > slowness while reading the data from FileStreamSinkLog dir as spark is > defaulting to the "__spark__metadata" dir for the read. > We need a functionality to purge the compact file size. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32821) cannot group by with window in sql statement for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192573#comment-17192573 ] Johnny Bai commented on SPARK-32821: Let's take a discussion about watermark with window grammar in the SQL statement, and I would like to implement it. > cannot group by with window in sql statement for structured streaming with > watermark > > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Johnny Bai >Priority: Major > > current only support dsl style as below: > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32788) non-partitioned table scan should not have partition filter
[ https://issues.apache.org/jira/browse/SPARK-32788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32788: -- Affects Version/s: (was: 3.0.0) 3.0.1 > non-partitioned table scan should not have partition filter > --- > > Key: SPARK-32788 > URL: https://issues.apache.org/jira/browse/SPARK-32788 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0, 3.0.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32788) non-partitioned table scan should not have partition filter
[ https://issues.apache.org/jira/browse/SPARK-32788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32788: -- Affects Version/s: 3.0.0 > non-partitioned table scan should not have partition filter > --- > > Key: SPARK-32788 > URL: https://issues.apache.org/jira/browse/SPARK-32788 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0, 3.0.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27089) Loss of precision during decimal division
[ https://issues.apache.org/jira/browse/SPARK-27089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192566#comment-17192566 ] Daeho Ro commented on SPARK-27089: -- It seems that the bug persists on the spark version 3.0.0 > Loss of precision during decimal division > - > > Key: SPARK-27089 > URL: https://issues.apache.org/jira/browse/SPARK-27089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: ylo0ztlmtusq >Priority: Major > > Spark looses decimal places when dividing decimal numbers. > > Expected behavior (In Spark 2.2.3 or before) > > {code:java} > scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as > decimal(38,14)) as decimal(38,14)) val""" > sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as > decimal(38,14)) as decimal(38,14)) val > scala> spark.sql(sql).show > 19/03/07 21:23:51 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > ++ > | val| > ++ > |0.33| > ++ > {code} > > Current behavior (In Spark 2.3.2 and later) > > {code:java} > scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as > decimal(38,14)) as decimal(38,14)) val""" > sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as > decimal(38,14)) as decimal(38,14)) val > scala> spark.sql(sql).show > ++ > | val| > ++ > |0.33| > ++ > {code} > > Seems to caused by {{promote_precision(38, 6) }} > > {code:java} > scala> spark.sql(sql).explain(true) > == Parsed Logical Plan == > Project [cast((cast(3 as decimal(38,14)) / cast(9 as decimal(38,14))) as > decimal(38,14)) AS val#20] > +- OneRowRelation > == Analyzed Logical Plan == > val: decimal(38,14) > Project [cast(CheckOverflow((promote_precision(cast(cast(3 as decimal(38,14)) > as decimal(38,14))) / promote_precision(cast(cast(9 as decimal(38,14)) as > decimal(38,14, DecimalType(38,6)) as decimal(38,14)) AS val#20] > +- OneRowRelation > == Optimized Logical Plan == > Project [0.33 AS val#20] > +- OneRowRelation > == Physical Plan == > *(1) Project [0.33 AS val#20] > +- Scan OneRowRelation[] > {code} > > Source https://stackoverflow.com/q/55046492 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql statement for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Bai updated SPARK-32821: --- Summary: cannot group by with window in sql statement for structured streaming with watermark (was: cannot group by with window in sql sentence for structured streaming with watermark) > cannot group by with window in sql statement for structured streaming with > watermark > > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Johnny Bai >Priority: Major > > current only support dsl style as below: > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192563#comment-17192563 ] Johnny Bai edited comment on SPARK-32821 at 9/9/20, 1:45 AM: - [~kabhwan] As structured streaming going, I think it is necessary to build a relative complete structured streaming SQL standard specification like the ANSI SQL standard was (Author: johnny bai): [~kabhwan] as structured streaming going, I think it is necessary to build a relative complete structured streaming SQL standard specification like the ANSI SQL standard > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Johnny Bai >Priority: Major > > current only support dsl style as below: > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192563#comment-17192563 ] Johnny Bai commented on SPARK-32821: [~kabhwan] as structured streaming going, I think it is necessary to build a relative complete structured streaming SQL standard specification like the ANSI SQL standard > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Johnny Bai >Priority: Major > > current only support dsl style as below: > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32823) Standalone Master UI resources in use wrong
[ https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32823. -- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 29683 [https://github.com/apache/spark/pull/29683] > Standalone Master UI resources in use wrong > --- > > Key: SPARK-32823 > URL: https://issues.apache.org/jira/browse/SPARK-32823 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > I was using the standalone deployment with workers with GPUs and the master > ui was wrong for: > * *Resources in use:* 0 / 4 gpu > In this case I had 2 workers, each with 4 gpus, so this total should have > been 8. It seems like its just looking at a single worker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32823) Standalone Master UI resources in use wrong
[ https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32823: Assignee: Thomas Graves > Standalone Master UI resources in use wrong > --- > > Key: SPARK-32823 > URL: https://issues.apache.org/jira/browse/SPARK-32823 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > I was using the standalone deployment with workers with GPUs and the master > ui was wrong for: > * *Resources in use:* 0 / 4 gpu > In this case I had 2 workers, each with 4 gpus, so this total should have > been 8. It seems like its just looking at a single worker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32824) The error is confusing when resource .amount not provided
[ https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32824. -- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 29685 [https://github.com/apache/spark/pull/29685] > The error is confusing when resource .amount not provided > -- > > Key: SPARK-32824 > URL: https://issues.apache.org/jira/browse/SPARK-32824 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > If the user forgets to specify the .amount when specifying a resource, the > error that comes out is confusing, we should improve. > > $ $SPARK_HOME/bin/spark-shell --master spark://host9:7077 --conf > spark.executor.resource.gpu=1 > > {code:java} > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.propertiesSetting default log level to > "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error > initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String > index out of range: -1 at java.lang.String.substring(String.java:1967) at > org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150) > at > org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158) > at > org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884) > at org.apache.spark.SparkContext.(SparkContext.scala:528) at > org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at > $line3.$read$$iw$$iw.(:15) at > $line3.$read$$iw.(:42) at > $line3.$read.(:44){code} > ' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32824) The error is confusing when resource .amount not provided
[ https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32824: Assignee: Thomas Graves > The error is confusing when resource .amount not provided > -- > > Key: SPARK-32824 > URL: https://issues.apache.org/jira/browse/SPARK-32824 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > If the user forgets to specify the .amount when specifying a resource, the > error that comes out is confusing, we should improve. > > $ $SPARK_HOME/bin/spark-shell --master spark://host9:7077 --conf > spark.executor.resource.gpu=1 > > {code:java} > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.propertiesSetting default log level to > "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error > initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String > index out of range: -1 at java.lang.String.substring(String.java:1967) at > org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150) > at > org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158) > at > org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884) > at org.apache.spark.SparkContext.(SparkContext.scala:528) at > org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at > $line3.$read$$iw$$iw.(:15) at > $line3.$read$$iw.(:42) at > $line3.$read.(:44){code} > ' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32638) WidenSetOperationTypes in subquery attribute missing
[ https://issues.apache.org/jira/browse/SPARK-32638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32638: - Fix Version/s: 3.0.2 > WidenSetOperationTypes in subquery attribute missing > -- > > Key: SPARK-32638 > URL: https://issues.apache.org/jira/browse/SPARK-32638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Guojian Li >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > I am migrating sql from mysql to spark sql, meet a very strange case. Below > is code to reproduce the exception: > > {code:java} > val spark = SparkSession.builder() > .master("local") > .appName("Word Count") > .getOrCreate() > spark.sparkContext.setLogLevel("TRACE") > val DecimalType = DataTypes.createDecimalType(20, 2) > val schema = StructType(List( > StructField("a", DecimalType, true) > )) > val dataList = new util.ArrayList[Row]() > val df=spark.createDataFrame(dataList,schema) > df.printSchema() > df.createTempView("test") > val sql= > """ > |SELECT t.kpi_04 FROM > |( > | SELECT a as `kpi_04` FROM test > | UNION ALL > | SELECT a+a as `kpi_04` FROM test > |) t > | > """.stripMargin > spark.sql(sql) > {code} > > Exception Message: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved > attribute(s) kpi_04#2 missing from kpi_04#4 in operator !Project [kpi_04#2]. > Attribute(s) with the same name appear in the operation: kpi_04. Please check > if the right attribute(s) are used.;; > !Project [kpi_04#2] > +- SubqueryAlias t > +- Union > :- Project [cast(kpi_04#2 as decimal(21,2)) AS kpi_04#4] > : +- Project [a#0 AS kpi_04#2] > : +- SubqueryAlias test > : +- LocalRelation , [a#0] > +- Project [kpi_04#3] > +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + > promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS > kpi_04#3] > +- SubqueryAlias test > +- LocalRelation , [a#0]{code} > > > Base the trace log ,seemly the WidenSetOperationTypes add new outer project > layer. It caused the parent query lose the reference to subquery. > > > {code:java} > > === Applying Rule > org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes === > !'Project [kpi_04#2] !Project [kpi_04#2] > !+- 'SubqueryAlias t +- SubqueryAlias t > ! +- 'Union +- Union > ! :- Project [a#0 AS kpi_04#2] :- Project [cast(kpi_04#2 as decimal(21,2)) AS > kpi_04#4] > ! : +- SubqueryAlias test : +- Project [a#0 AS kpi_04#2] > ! : +- LocalRelation , [a#0] : +- SubqueryAlias test > ! +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + > promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS > kpi_04#3] : +- LocalRelation , [a#0] > ! +- SubqueryAlias test +- Project [kpi_04#3] > ! +- LocalRelation , [a#0] +- Project > [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + > promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS > kpi_04#3] > ! +- SubqueryAlias test > ! +- LocalRelation , [a#0] > {code} > > in the source code ,WidenSetOperationTypes.scala. it is a intent behavior, > but possibly miss this edge case. > I hope someone can help me out to fix it . > > > {code:java} > if (targetTypes.nonEmpty) { > // Add an extra Project if the targetTypes are different from the original > types. > children.map(widenTypes(_, targetTypes)) > } else { > // Unable to find a target type to widen, then just return the original set. > children > }{code} > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32812) Run tests script for Python fails in certain environments
[ https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32812: -- Fix Version/s: (was: 2.4.8) 2.4.7 > Run tests script for Python fails in certain environments > - > > Key: SPARK-32812 > URL: https://issues.apache.org/jira/browse/SPARK-32812 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 2.4.7, 3.1.0, 3.0.2 > > > When running PySpark test in the local environment with "python/run-tests" > command, the following error could occur. > {code} > Traceback (most recent call last): > File "", line 1, in > ... > raise RuntimeError(''' > RuntimeError: > An attempt has been made to start a new process before the > current process has finished its bootstrapping phase. > This probably means that you are not using fork to start your > child processes and you have forgotten to use the proper idiom > in the main module: > if __name__ == '__main__': > freeze_support() > ... > The "freeze_support()" line can be omitted if the program > is not going to be frozen to produce an executable. > Traceback (most recent call last): > ... > raise EOFError > EOFError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32810: -- Fix Version/s: (was: 2.4.8) 2.4.7 > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.7, 3.1.0, 3.0.2 > > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32810: - Assignee: Maxim Gekk (was: Apache Spark) > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.8, 3.1.0, 3.0.2 > > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32312) Upgrade Apache Arrow to 1.0.0
[ https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32312: Assignee: (was: Apache Spark) > Upgrade Apache Arrow to 1.0.0 > - > > Key: SPARK-32312 > URL: https://issues.apache.org/jira/browse/SPARK-32312 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Apache Arrow will soon release v1.0.0 which provides backward/forward > compatibility guarantees as well as a number of fixes and improvements. This > will upgrade the Java artifact and PySpark API. Although PySpark will not > need special changes, it might be a good idea to bump up minimum supported > version and CI testing. > TBD: list of important improvements and fixes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32312) Upgrade Apache Arrow to 1.0.0
[ https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32312: Assignee: Apache Spark > Upgrade Apache Arrow to 1.0.0 > - > > Key: SPARK-32312 > URL: https://issues.apache.org/jira/browse/SPARK-32312 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Major > > Apache Arrow will soon release v1.0.0 which provides backward/forward > compatibility guarantees as well as a number of fixes and improvements. This > will upgrade the Java artifact and PySpark API. Although PySpark will not > need special changes, it might be a good idea to bump up minimum supported > version and CI testing. > TBD: list of important improvements and fixes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0
[ https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192422#comment-17192422 ] Apache Spark commented on SPARK-32312: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/29686 > Upgrade Apache Arrow to 1.0.0 > - > > Key: SPARK-32312 > URL: https://issues.apache.org/jira/browse/SPARK-32312 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Apache Arrow will soon release v1.0.0 which provides backward/forward > compatibility guarantees as well as a number of fixes and improvements. This > will upgrade the Java artifact and PySpark API. Although PySpark will not > need special changes, it might be a good idea to bump up minimum supported > version and CI testing. > TBD: list of important improvements and fixes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32824) The error is confusing when resource .amount not provided
[ https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192394#comment-17192394 ] Apache Spark commented on SPARK-32824: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/29685 > The error is confusing when resource .amount not provided > -- > > Key: SPARK-32824 > URL: https://issues.apache.org/jira/browse/SPARK-32824 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > If the user forgets to specify the .amount when specifying a resource, the > error that comes out is confusing, we should improve. > > $ $SPARK_HOME/bin/spark-shell --master spark://host9:7077 --conf > spark.executor.resource.gpu=1 > > {code:java} > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.propertiesSetting default log level to > "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error > initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String > index out of range: -1 at java.lang.String.substring(String.java:1967) at > org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150) > at > org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158) > at > org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884) > at org.apache.spark.SparkContext.(SparkContext.scala:528) at > org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at > $line3.$read$$iw$$iw.(:15) at > $line3.$read$$iw.(:42) at > $line3.$read.(:44){code} > ' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32824) The error is confusing when resource .amount not provided
[ https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32824: Assignee: Apache Spark > The error is confusing when resource .amount not provided > -- > > Key: SPARK-32824 > URL: https://issues.apache.org/jira/browse/SPARK-32824 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Major > > If the user forgets to specify the .amount when specifying a resource, the > error that comes out is confusing, we should improve. > > $ $SPARK_HOME/bin/spark-shell --master spark://host9:7077 --conf > spark.executor.resource.gpu=1 > > {code:java} > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.propertiesSetting default log level to > "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error > initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String > index out of range: -1 at java.lang.String.substring(String.java:1967) at > org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150) > at > org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158) > at > org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884) > at org.apache.spark.SparkContext.(SparkContext.scala:528) at > org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at > $line3.$read$$iw$$iw.(:15) at > $line3.$read$$iw.(:42) at > $line3.$read.(:44){code} > ' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32824) The error is confusing when resource .amount not provided
[ https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192393#comment-17192393 ] Apache Spark commented on SPARK-32824: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/29685 > The error is confusing when resource .amount not provided > -- > > Key: SPARK-32824 > URL: https://issues.apache.org/jira/browse/SPARK-32824 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > If the user forgets to specify the .amount when specifying a resource, the > error that comes out is confusing, we should improve. > > $ $SPARK_HOME/bin/spark-shell --master spark://host9:7077 --conf > spark.executor.resource.gpu=1 > > {code:java} > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.propertiesSetting default log level to > "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error > initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String > index out of range: -1 at java.lang.String.substring(String.java:1967) at > org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150) > at > org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158) > at > org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884) > at org.apache.spark.SparkContext.(SparkContext.scala:528) at > org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at > $line3.$read$$iw$$iw.(:15) at > $line3.$read$$iw.(:42) at > $line3.$read.(:44){code} > ' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32824) The error is confusing when resource .amount not provided
[ https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32824: Assignee: (was: Apache Spark) > The error is confusing when resource .amount not provided > -- > > Key: SPARK-32824 > URL: https://issues.apache.org/jira/browse/SPARK-32824 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > If the user forgets to specify the .amount when specifying a resource, the > error that comes out is confusing, we should improve. > > $ $SPARK_HOME/bin/spark-shell --master spark://host9:7077 --conf > spark.executor.resource.gpu=1 > > {code:java} > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.propertiesSetting default log level to > "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error > initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String > index out of range: -1 at java.lang.String.substring(String.java:1967) at > org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150) > at > org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158) > at > org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884) > at org.apache.spark.SparkContext.(SparkContext.scala:528) at > org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at > $line3.$read$$iw$$iw.(:15) at > $line3.$read$$iw.(:42) at > $line3.$read.(:44){code} > ' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32825) CTE support on MSSQL
Ankit Sinha created SPARK-32825: --- Summary: CTE support on MSSQL Key: SPARK-32825 URL: https://issues.apache.org/jira/browse/SPARK-32825 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.6 Reporter: Ankit Sinha The issue in detail along with stack trace is described over [here|https://github.com/microsoft/mssql-jdbc/issues/1340]. Summary is that WITH CTE clause does not work with MSSQL. This may be because WITH CTE is not supported in FROM clause of MSSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192343#comment-17192343 ] Apache Spark commented on SPARK-32810: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29684 > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 2.4.8, 3.1.0, 3.0.2 > > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32135) Show Spark Driver name on Spark history web page
[ https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaurangi Saxena updated SPARK-32135: Description: Our service dynamically creates short-lived YARN clusters in cloud. Spark applications run on these dynamically created clusters. Log data for these applications is stored on a remote file-system. We configure a static instance of SparkHistoryServer to view information on jobs that ran on these clusters. Since we are using a single History Server for multiple clusters, it will be useful for users to have information on where the jobs were executed. We would like to display this information on the main web-page, instead of having users go through multiple links to retrieve this information. was:We would like to see spark driver host on the history server web page > Show Spark Driver name on Spark history web page > > > Key: SPARK-32135 > URL: https://issues.apache.org/jira/browse/SPARK-32135 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > Attachments: image-2020-09-02-12-37-55-860.png > > > Our service dynamically creates short-lived YARN clusters in cloud. Spark > applications run on these dynamically created clusters. Log data for these > applications is stored on a remote file-system. We configure a static > instance of SparkHistoryServer to view information on jobs that ran on these > clusters. > Since we are using a single History Server for multiple clusters, it will be > useful for users to have information on where the jobs were executed. We > would like to display this information on the main web-page, instead of > having users go through multiple links to retrieve this information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32097) Allow reading history log files from multiple directories
[ https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaurangi Saxena updated SPARK-32097: Description: Our service dynamically creates short-lived YARN clusters in cloud. Spark applications run on these dynamically created clusters. Log data for these applications is stored on a remote file-system. We want a static instance of SparkHistoryServer to view information on jobs that ran on these clusters. We use glob because we cannot have a static list of directories where the log files reside. (was: We would like to configure SparkHistoryServer to display applications from multiple clusters/environments. Data displayed on this UI comes from directory configured as log-directory. It would be nice if this log-directory also accepted regex. This way we will be able to read and display applications from multiple directories. ) > Allow reading history log files from multiple directories > - > > Key: SPARK-32097 > URL: https://issues.apache.org/jira/browse/SPARK-32097 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > > Our service dynamically creates short-lived YARN clusters in cloud. Spark > applications run on these dynamically created clusters. Log data for these > applications is stored on a remote file-system. We want a static instance of > SparkHistoryServer to view information on jobs that ran on these clusters. We > use glob because we cannot have a static list of directories where the log > files reside. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32823) Standalone Master UI resources in use wrong
[ https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32823: Assignee: Apache Spark > Standalone Master UI resources in use wrong > --- > > Key: SPARK-32823 > URL: https://issues.apache.org/jira/browse/SPARK-32823 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Major > > I was using the standalone deployment with workers with GPUs and the master > ui was wrong for: > * *Resources in use:* 0 / 4 gpu > In this case I had 2 workers, each with 4 gpus, so this total should have > been 8. It seems like its just looking at a single worker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32823) Standalone Master UI resources in use wrong
[ https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192295#comment-17192295 ] Apache Spark commented on SPARK-32823: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/29683 > Standalone Master UI resources in use wrong > --- > > Key: SPARK-32823 > URL: https://issues.apache.org/jira/browse/SPARK-32823 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > I was using the standalone deployment with workers with GPUs and the master > ui was wrong for: > * *Resources in use:* 0 / 4 gpu > In this case I had 2 workers, each with 4 gpus, so this total should have > been 8. It seems like its just looking at a single worker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32823) Standalone Master UI resources in use wrong
[ https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32823: Assignee: (was: Apache Spark) > Standalone Master UI resources in use wrong > --- > > Key: SPARK-32823 > URL: https://issues.apache.org/jira/browse/SPARK-32823 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > I was using the standalone deployment with workers with GPUs and the master > ui was wrong for: > * *Resources in use:* 0 / 4 gpu > In this case I had 2 workers, each with 4 gpus, so this total should have > been 8. It seems like its just looking at a single worker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280 ] Avner Livne edited comment on SPARK-24295 at 9/8/20, 3:45 PM: -- for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/tmp/${file.getName.toString}" removePath(compacted_file, fs) val lines = sc.textFile(file.toString) val reduced_lines = lines.mapPartitions({ p => implicit val formats = DefaultFormats p.collect({ case "v1" => "v1" case x if { parse(x).extract[SinkFileStatus].modificationTime > ttl } => x }) }).coalesce(1) println(s"removing ${lines.count - reduced_lines.count} lines from ${file.toString}...") reduced_lines.saveAsTextFile(compacted_file) FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, false, sc.hadoopConfiguration) removePath(compacted_file, fs) } /** * get last compacted files if exists */ def getLastCompactFile(path: Path) = { fs.listFiles(path, true).toList.sortBy(_.getModificationTime).reverse.collectFirst({ case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) => x.getPath }) } val my_folder = "/my/root/spark/structerd/streaming/output/folder" val metadata_folder = new Path(s"$my_folder/_spark_metadata")) getLastCompactFile(metadata_folder).map(x => compact(x, 20)) val df = spark .readStream .format("kafka") ///. whatever stream you like df.writeStream .trigger(Trigger.ProcessingTime(30)) .format("parquet") .outputMode(OutputMode.Append()) .option("checkpointLocation", "/my/checkpoint/path") .option("path", my_folder) .start() {code} this example will retain SinkFileStatus from the last 20 days and will purge everything else I run this code on driver startup - but it can certainly run async in some sidecar cronjob tested on spark 3.0.0 writing parquet files was (Author: sta...@gmail.com): for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/tmp/${fil
[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280 ] Avner Livne edited comment on SPARK-24295 at 9/8/20, 3:45 PM: -- for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/tmp/${file.getName.toString}" removePath(compacted_file, fs) val lines = sc.textFile(file.toString) val reduced_lines = lines.mapPartitions({ p => implicit val formats = DefaultFormats p.collect({ case "v1" => "v1" case x if { parse(x).extract[SinkFileStatus].modificationTime > ttl } => x }) }).coalesce(1) println(s"removing ${lines.count - reduced_lines.count} lines from ${file.toString}...") reduced_lines.saveAsTextFile(compacted_file) FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, false, sc.hadoopConfiguration) removePath(compacted_file, fs) } /** * get last compacted files if exists */ def getLastCompactFile(path: Path) = { fs.listFiles(path, true).toList.sortBy(_.getModificationTime).reverse.collectFirst({ case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) => x.getPath }) } val my_folder = "/my/root/spark/structerd/streaming/output/folder" val metadata_folder = new Path(s"$my_folder/_spark_metadata")) getLastCompactFile(metadata_folder).map(x => compact(x, 20)) val df = spark .readStream .format("kafka") ///. whatever stream you like df.writeStream .trigger(Trigger.ProcessingTime(30)) .format("parquet") .outputMode(OutputMode.Append()) .option("checkpointLocation", "/my/checkpoint/path") .option("path", my_folder ) .start() {code} this example will retain SinkFileStatus from the last 20 days and will purge everything else I run this code on driver startup - but it can certainly run async in some sidecar cronjob tested on spark 3.0.0 writing parquet files was (Author: sta...@gmail.com): for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/tmp/${fi
[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280 ] Avner Livne edited comment on SPARK-24295 at 9/8/20, 3:44 PM: -- for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/tmp/${file.getName.toString}" removePath(compacted_file, fs) val lines = sc.textFile(file.toString) val reduced_lines = lines.mapPartitions({ p => implicit val formats = DefaultFormats p.collect({ case "v1" => "v1" case x if { parse(x).extract[SinkFileStatus].modificationTime > ttl } => x }) }).coalesce(1) println(s"removing ${lines.count - reduced_lines.count} lines from ${file.toString}...") reduced_lines.saveAsTextFile(compacted_file) FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, false, sc.hadoopConfiguration) removePath(compacted_file, fs) } /** * get last compacted files if exists */ def getLastCompactFile(path: Path) = { fs.listFiles(path, true).toList.sortBy(_.getModificationTime).reverse.collectFirst({ case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) => x.getPath }) } val my_folder = "/my/root/spark/structerd/streaming/output/folder" val metadata_folder = new Path(s"$my_folder/_spark_metadata")) getLastCompactFile(metadata_folder).map(x => compact(x, 20)) val df = spark .readStream .format("kafka") ///. whatever stream you like df3 .writeStream .trigger(Trigger.ProcessingTime(30)) .format("parquet") .outputMode(OutputMode.Append()) .option("checkpointLocation", "/my/checkpoint/path") .option("path", my_folder ) .start() {code} this example will retain SinkFileStatus from the last 20 days and will purge everything else I run this code on driver startup - but it can certainly run async in some sidecar cronjob tested on spark 3.0.0 writing parquet files was (Author: sta...@gmail.com): for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/
[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280 ] Avner Livne edited comment on SPARK-24295 at 9/8/20, 3:42 PM: -- for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/tmp/${file.getName.toString}" removePath(compacted_file, fs) val lines = sc.textFile(file.toString) val reduced_lines = lines.mapPartitions({ p => implicit val formats = DefaultFormats p.collect({ case "v1" => "v1" case x if { parse(x).extract[SinkFileStatus].modificationTime > ttl } => x }) }).coalesce(1) println(s"removing ${lines.count - reduced_lines.count} lines from ${file.toString}...") reduced_lines.saveAsTextFile(compacted_file) FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, false, sc.hadoopConfiguration) removePath(compacted_file, fs) } /** * get last compacted files if exists */ def getLastCompactFile(path: Path) = { fs.listFiles(path, true).toList.sortBy(_.getModificationTime).reverse.collectFirst({ case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) => x.getPath }) } val my_folder = "/my/root/spark/structerd/streaming/output/folder" val metadata_folder = new Path(s"$my_folder/_spark_metadata")) getLastCompactFile(metadata_folder).map(x => compact(x, 20)) val df = spark .readStream .format("kafka") ///. whatever stream you like {code} this example will retain SinkFileStatus from the last 20 days and will purge everything else I run this code on driver startup - but it can certainly run async in some sidecar cronjob tested on spark 3.0.0 writing parquet files was (Author: sta...@gmail.com): for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/tmp/${file.getName.toString}" removePath(compacted_file, fs) val lines = sc.textFile(file.toString) val reduced_lines = lines.mapPartitions({ p => implicit val formats = DefaultFormats p.collect({
[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280 ] Avner Livne edited comment on SPARK-24295 at 9/8/20, 3:42 PM: -- for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/tmp/${file.getName.toString}" removePath(compacted_file, fs) val lines = sc.textFile(file.toString) val reduced_lines = lines.mapPartitions({ p => implicit val formats = DefaultFormats p.collect({ case "v1" => "v1" case x if { parse(x).extract[SinkFileStatus].modificationTime > ttl } => x }) }).coalesce(1) println(s"removing ${lines.count - reduced_lines.count} lines from ${file.toString}...") reduced_lines.saveAsTextFile(compacted_file) FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, false, sc.hadoopConfiguration) removePath(compacted_file, fs) } /** * get last compacted files if exists */ def getLastCompactFile(path: Path) = { fs.listFiles(path, true).toList.sortBy(_.getModificationTime).reverse.collectFirst({ case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) => x.getPath }) } val my_folder = "/my/root/spark/structerd/streaming/output/folder" val metadata_folder = new Path(s"$my_folder/_spark_metadata")) getLastCompactFile(metadata_folder).map(x => compact(x, 20)) val df = spark .readStream .format("kafka") ///. whatever stream you like {code} this example will retain SinkFileStatus from the last 20 days and will purge everything else I run this code on driver startup - but it can certainly run async in some sidecar cronjob tested on spark 3.0.0 writing parquet files was (Author: sta...@gmail.com): for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/tmp/${file.getName.toString}" removePath(compacted_file, fs) val lines = sc.textFile(file.toString) val reduced_lines = lines.mapPartitions({ p => implicit val formats = DefaultFormats p.collect({
[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280 ] Avner Livne commented on SPARK-24295: - for those looking for a temporary workaround: run this code before you start the streaming: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._ import org.apache.spark.sql._ import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.sql.execution.streaming._ /** * regex to find last compact file */ val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored val fs = FileSystem.get(sc.hadoopConfiguration) /** * implicit hadoop RemoteIterator convertor */ implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): Iterator[T] = { case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] { override def hasNext: Boolean = underlying.hasNext override def next: T = underlying.next } wrapper(underlying) } /** * delete file or folder recursively */ def removePath(dstPath: String, fs: FileSystem): Unit = { val path = new Path(dstPath) if (fs.exists(path)) { println(s"deleting ${dstPath}...") fs.delete(path, true) } } /** * remove json entries older than `days` from compact file * preserve `v1` at the head of the file * re write the small file back to the original destination */ def compact(file: Path, days: Int = 20) = { val ttl = new java.util.Date().getTime - java.util.concurrent.TimeUnit.DAYS.toMillis(days) val compacted_file = s"/tmp/${file.getName.toString}" removePath(compacted_file, fs) val lines = sc.textFile(file.toString) val reduced_lines = lines.mapPartitions({ p => implicit val formats = DefaultFormats p.collect({ case "v1" => "v1" case x if { parse(x).extract[SinkFileStatus].modificationTime > ttl } => x }) }).coalesce(1) println(s"removing ${lines.count - reduced_lines.count} lines from ${file.toString}...") reduced_lines.saveAsTextFile(compacted_file) FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, false, sc.hadoopConfiguration) removePath(compacted_file, fs) } /** * get last compacted files if exists */ def getLastCompactFile(path: Path) = { fs.listFiles(path, true).toList.sortBy(_.getModificationTime).reverse.collectFirst({ case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) => x.getPath }) } val my_folder = "/my/root/spark/structerd/streaming/output/folder" val metadata_folder = new Path(s"$my_folder/_spark_metadata")) getLastCompactFile(metadata_folder).map(x => compact(x, 20)) val df1 = spark .readStream .format("kafka") ///. whatever stream you like {code} this example will retain SinkFileStatus from the last 20 days and will purge everything else I run this code on driver startup - but it can certainly run async in some sidecar cronjob tested on spark 3.0.0 writing parquet files > Purge Structured streaming FileStreamSinkLog metadata compact file data. > > > Key: SPARK-24295 > URL: https://issues.apache.org/jira/browse/SPARK-24295 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Iqbal Singh >Priority: Major > Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz > > > FileStreamSinkLog metadata logs are concatenated to a single compact file > after defined compact interval. > For long running jobs, compact file size can grow up to 10's of GB's, Causing > slowness while reading the data from FileStreamSinkLog dir as spark is > defaulting to the "__spark__metadata" dir for the read. > We need a functionality to purge the compact file size. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32824) The error is confusing when resource .amount not provided
Thomas Graves created SPARK-32824: - Summary: The error is confusing when resource .amount not provided Key: SPARK-32824 URL: https://issues.apache.org/jira/browse/SPARK-32824 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Thomas Graves If the user forgets to specify the .amount when specifying a resource, the error that comes out is confusing, we should improve. $ $SPARK_HOME/bin/spark-shell --master spark://host9:7077 --conf spark.executor.resource.gpu=1 {code:java} Using Spark's default log4j profile: org/apache/spark/log4j-defaults.propertiesSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150) at org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158) at org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773) at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884) at org.apache.spark.SparkContext.(SparkContext.scala:528) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at $line3.$read$$iw$$iw.(:15) at $line3.$read$$iw.(:42) at $line3.$read.(:44){code} ' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32823) Standalone Master UI resources in use wrong
Thomas Graves created SPARK-32823: - Summary: Standalone Master UI resources in use wrong Key: SPARK-32823 URL: https://issues.apache.org/jira/browse/SPARK-32823 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.0 Reporter: Thomas Graves I was using the standalone deployment with workers with GPUs and the master ui was wrong for: * *Resources in use:* 0 / 4 gpu In this case I had 2 workers, each with 4 gpus, so this total should have been 8. It seems like its just looking at a single worker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32823) Standalone Master UI resources in use wrong
[ https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192256#comment-17192256 ] Thomas Graves commented on SPARK-32823: --- I'm looking into this. > Standalone Master UI resources in use wrong > --- > > Key: SPARK-32823 > URL: https://issues.apache.org/jira/browse/SPARK-32823 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > I was using the standalone deployment with workers with GPUs and the master > ui was wrong for: > * *Resources in use:* 0 / 4 gpu > In this case I had 2 workers, each with 4 gpus, so this total should have > been 8. It seems like its just looking at a single worker. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE
[ https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192252#comment-17192252 ] Apache Spark commented on SPARK-32753: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/29682 > Deduplicating and repartitioning the same column create duplicate rows with > AQE > --- > > Key: SPARK-32753 > URL: https://issues.apache.org/jira/browse/SPARK-32753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Major > Labels: correctness > Fix For: 3.1.0, 3.0.2 > > > To reproduce: > {code:java} > spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") > val df = spark.sql("select id from v1 group by id distribute by id") > println(df.collect().toArray.mkString(",")) > println(df.queryExecution.executedPlan) > // With AQE > [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] > AdaptiveSparkPlan(isFinalPlan=true) > +- CustomShuffleReader local >+- ShuffleQueryStage 0 > +- Exchange hashpartitioning(id#183L, 10), true > +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) > +- Union >:- *(1) Range (0, 10, step=1, splits=2) >+- *(2) Range (0, 10, step=1, splits=2) > // Without AQE > [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] > *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Exchange hashpartitioning(id#206L, 10), true >+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Union > :- *(1) Range (0, 10, step=1, splits=2) > +- *(2) Range (0, 10, step=1, splits=2){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE
[ https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192251#comment-17192251 ] Apache Spark commented on SPARK-32753: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/29682 > Deduplicating and repartitioning the same column create duplicate rows with > AQE > --- > > Key: SPARK-32753 > URL: https://issues.apache.org/jira/browse/SPARK-32753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Major > Labels: correctness > Fix For: 3.1.0, 3.0.2 > > > To reproduce: > {code:java} > spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") > val df = spark.sql("select id from v1 group by id distribute by id") > println(df.collect().toArray.mkString(",")) > println(df.queryExecution.executedPlan) > // With AQE > [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] > AdaptiveSparkPlan(isFinalPlan=true) > +- CustomShuffleReader local >+- ShuffleQueryStage 0 > +- Exchange hashpartitioning(id#183L, 10), true > +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) > +- Union >:- *(1) Range (0, 10, step=1, splits=2) >+- *(2) Range (0, 10, step=1, splits=2) > // Without AQE > [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] > *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Exchange hashpartitioning(id#206L, 10), true >+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Union > :- *(1) Range (0, 10, step=1, splits=2) > +- *(2) Range (0, 10, step=1, splits=2){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32815) Fix LibSVM data source loading error on file paths with glob metacharacters
[ https://issues.apache.org/jira/browse/SPARK-32815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-32815: Fix Version/s: 3.0.2 2.4.8 > Fix LibSVM data source loading error on file paths with glob metacharacters > --- > > Key: SPARK-32815 > URL: https://issues.apache.org/jira/browse/SPARK-32815 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.8, 3.1.0, 3.0.2 > > > SPARK-32810 fixed a long standing bug in a few Spark built-in data sources > that fails to read files whose names contain glob metacharacters, such as [, > ], \{, }, etc. > CSV and JSON data source on the Spark side were affected. We've also noticed > that the LibSVM data source had the same code pattern that leads to the bug, > so the fix https://github.com/apache/spark/pull/29659 included a fix for that > data source as well, but it did not include a test for the LibSVM data source. > This ticket tracks adding a test case for LibSVM, similar to the ones for > CSV/JSON, to verify whether or not the fix works as intended. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32815) Fix LibSVM data source loading error on file paths with glob metacharacters
[ https://issues.apache.org/jira/browse/SPARK-32815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32815. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29670 [https://github.com/apache/spark/pull/29670] > Fix LibSVM data source loading error on file paths with glob metacharacters > --- > > Key: SPARK-32815 > URL: https://issues.apache.org/jira/browse/SPARK-32815 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > SPARK-32810 fixed a long standing bug in a few Spark built-in data sources > that fails to read files whose names contain glob metacharacters, such as [, > ], \{, }, etc. > CSV and JSON data source on the Spark side were affected. We've also noticed > that the LibSVM data source had the same code pattern that leads to the bug, > so the fix https://github.com/apache/spark/pull/29659 included a fix for that > data source as well, but it did not include a test for the LibSVM data source. > This ticket tracks adding a test case for LibSVM, similar to the ones for > CSV/JSON, to verify whether or not the fix works as intended. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32815) Fix LibSVM data source loading error on file paths with glob metacharacters
[ https://issues.apache.org/jira/browse/SPARK-32815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32815: --- Assignee: Maxim Gekk > Fix LibSVM data source loading error on file paths with glob metacharacters > --- > > Key: SPARK-32815 > URL: https://issues.apache.org/jira/browse/SPARK-32815 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > SPARK-32810 fixed a long standing bug in a few Spark built-in data sources > that fails to read files whose names contain glob metacharacters, such as [, > ], \{, }, etc. > CSV and JSON data source on the Spark side were affected. We've also noticed > that the LibSVM data source had the same code pattern that leads to the bug, > so the fix https://github.com/apache/spark/pull/29659 included a fix for that > data source as well, but it did not include a test for the LibSVM data source. > This ticket tracks adding a test case for LibSVM, similar to the ones for > CSV/JSON, to verify whether or not the fix works as intended. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE
[ https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-32753: Fix Version/s: 3.0.2 > Deduplicating and repartitioning the same column create duplicate rows with > AQE > --- > > Key: SPARK-32753 > URL: https://issues.apache.org/jira/browse/SPARK-32753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Major > Labels: correctness > Fix For: 3.1.0, 3.0.2 > > > To reproduce: > {code:java} > spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") > val df = spark.sql("select id from v1 group by id distribute by id") > println(df.collect().toArray.mkString(",")) > println(df.queryExecution.executedPlan) > // With AQE > [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] > AdaptiveSparkPlan(isFinalPlan=true) > +- CustomShuffleReader local >+- ShuffleQueryStage 0 > +- Exchange hashpartitioning(id#183L, 10), true > +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) > +- Union >:- *(1) Range (0, 10, step=1, splits=2) >+- *(2) Range (0, 10, step=1, splits=2) > // Without AQE > [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] > *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Exchange hashpartitioning(id#206L, 10), true >+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Union > :- *(1) Range (0, 10, step=1, splits=2) > +- *(2) Range (0, 10, step=1, splits=2){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32817) DPP throws error when broadcast side is empty
[ https://issues.apache.org/jira/browse/SPARK-32817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32817: - Affects Version/s: (was: 3.0.0) 3.1.0 > DPP throws error when broadcast side is empty > - > > Key: SPARK-32817 > URL: https://issues.apache.org/jira/browse/SPARK-32817 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Major > Fix For: 3.1.0 > > > In `SubqueryBroadcastExec.relationFuture`, if the `broadcastRelation` is an > `EmptyHashedRelation`, then `broadcastRelation.keys()` will throw > `UnsupportedOperationException`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32817) DPP throws error when broadcast side is empty
[ https://issues.apache.org/jira/browse/SPARK-32817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-32817. -- Fix Version/s: 3.1.0 Assignee: Zhenhua Wang Resolution: Fixed Resolved by https://github.com/apache/spark/pull/29671 > DPP throws error when broadcast side is empty > - > > Key: SPARK-32817 > URL: https://issues.apache.org/jira/browse/SPARK-32817 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Major > Fix For: 3.1.0 > > > In `SubqueryBroadcastExec.relationFuture`, if the `broadcastRelation` is an > `EmptyHashedRelation`, then `broadcastRelation.keys()` will throw > `UnsupportedOperationException`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32822) Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back
[ https://issues.apache.org/jira/browse/SPARK-32822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192150#comment-17192150 ] Apache Spark commented on SPARK-32822: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/29681 > Change the number of partitions to zero when a range is empty with > WholeStageCodegen disabled or falled back > > > Key: SPARK-32822 > URL: https://issues.apache.org/jira/browse/SPARK-32822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > If WholeStageCodegen effects, the number of partitions of an empty range will > be changed to zero. But it doesn't changed when WholeStageCodegen is disabled > or falled back. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32822) Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back
[ https://issues.apache.org/jira/browse/SPARK-32822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32822: Assignee: Apache Spark (was: Kousuke Saruta) > Change the number of partitions to zero when a range is empty with > WholeStageCodegen disabled or falled back > > > Key: SPARK-32822 > URL: https://issues.apache.org/jira/browse/SPARK-32822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > If WholeStageCodegen effects, the number of partitions of an empty range will > be changed to zero. But it doesn't changed when WholeStageCodegen is disabled > or falled back. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32822) Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back
[ https://issues.apache.org/jira/browse/SPARK-32822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32822: Assignee: Kousuke Saruta (was: Apache Spark) > Change the number of partitions to zero when a range is empty with > WholeStageCodegen disabled or falled back > > > Key: SPARK-32822 > URL: https://issues.apache.org/jira/browse/SPARK-32822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > If WholeStageCodegen effects, the number of partitions of an empty range will > be changed to zero. But it doesn't changed when WholeStageCodegen is disabled > or falled back. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-32748. -- Resolution: Won't Fix > Support local property propagation in SubqueryBroadcastExec > --- > > Key: SPARK-32748 > URL: https://issues.apache.org/jira/browse/SPARK-32748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Major > > Since SPARK-22590, local property propagation is supported through > `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and > `SubqueryExec` when computing `relationFuture`. > The propagation is missed in `SubqueryBroadcastExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro reopened SPARK-32748: -- > Support local property propagation in SubqueryBroadcastExec > --- > > Key: SPARK-32748 > URL: https://issues.apache.org/jira/browse/SPARK-32748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Major > > Since SPARK-22590, local property propagation is supported through > `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and > `SubqueryExec` when computing `relationFuture`. > The propagation is missed in `SubqueryBroadcastExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32748: - Fix Version/s: (was: 3.1.0) > Support local property propagation in SubqueryBroadcastExec > --- > > Key: SPARK-32748 > URL: https://issues.apache.org/jira/browse/SPARK-32748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Major > > Since SPARK-22590, local property propagation is supported through > `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and > `SubqueryExec` when computing `relationFuture`. > The propagation is missed in `SubqueryBroadcastExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32822) Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back
[ https://issues.apache.org/jira/browse/SPARK-32822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32822: --- Summary: Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back (was: Change the number of partition to zero when a range is empty with WholeStageCodegen disabled or falled back) > Change the number of partitions to zero when a range is empty with > WholeStageCodegen disabled or falled back > > > Key: SPARK-32822 > URL: https://issues.apache.org/jira/browse/SPARK-32822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > If WholeStageCodegen effects, the number of partition of an empty range will > be changed to zero. But it doesn't changed when WholeStageCodegen is disabled > or falled back. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32822) Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back
[ https://issues.apache.org/jira/browse/SPARK-32822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32822: --- Description: If WholeStageCodegen effects, the number of partitions of an empty range will be changed to zero. But it doesn't changed when WholeStageCodegen is disabled or falled back. (was: If WholeStageCodegen effects, the number of partition of an empty range will be changed to zero. But it doesn't changed when WholeStageCodegen is disabled or falled back.) > Change the number of partitions to zero when a range is empty with > WholeStageCodegen disabled or falled back > > > Key: SPARK-32822 > URL: https://issues.apache.org/jira/browse/SPARK-32822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > If WholeStageCodegen effects, the number of partitions of an empty range will > be changed to zero. But it doesn't changed when WholeStageCodegen is disabled > or falled back. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32822) Change the number of partition to zero when a range is empty with WholeStageCodegen disabled or falled back
Kousuke Saruta created SPARK-32822: -- Summary: Change the number of partition to zero when a range is empty with WholeStageCodegen disabled or falled back Key: SPARK-32822 URL: https://issues.apache.org/jira/browse/SPARK-32822 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta If WholeStageCodegen effects, the number of partition of an empty range will be changed to zero. But it doesn't changed when WholeStageCodegen is disabled or falled back. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192138#comment-17192138 ] Jungtaek Lim commented on SPARK-32821: -- Let's leave the fix version field be empty - the field will be filled when the issue is resolved with specific patch. It's a challenging work to bind all Dataset APIs with SQL statement, especially SS, given the fact Streaming SQL doesn't have a standard. > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Johnny Bai >Priority: Major > > current only support dsl style as below: > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-32821: - Fix Version/s: (was: 3.0.1) > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Johnny Bai >Priority: Major > Labels: 2.1.0 > > current only support dsl style as below: > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-32821: - Labels: (was: 2.1.0) > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Johnny Bai >Priority: Major > > current only support dsl style as below: > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Bai updated SPARK-32821: --- Description: current only support dsl style as below: import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count() but not support group by with window in sql style as below: "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field" was: import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count() but not support group by with window in sql style as below: "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field" > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Johnny Bai >Priority: Major > Labels: 2.1.0 > Fix For: 3.0.1 > > > current only support dsl style as below: > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Bai updated SPARK-32821: --- Affects Version/s: 2.2.0 2.3.0 2.4.0 > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Johnny Bai >Priority: Major > Labels: 2.1.0 > Fix For: 3.0.1 > > > > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Bai updated SPARK-32821: --- Labels: 2.1.0 (was: ) > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Johnny Bai >Priority: Major > Labels: 2.1.0 > Fix For: 3.0.1 > > > > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Bai updated SPARK-32821: --- Affects Version/s: (was: 3.0.0) 2.1.0 > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Johnny Bai >Priority: Major > Fix For: 3.0.1 > > > > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Bai updated SPARK-32821: --- Target Version/s: (was: 2.4.3) > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Johnny Bai >Priority: Major > Fix For: 3.0.1 > > > > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32811) Replace IN predicate of continuous range with boundary checks
[ https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vu Ho updated SPARK-32811: -- Description: This expression {code:java} select a from t where a in (1, 2, 3, 3, 4){code} can be translated to {code:java} select a from t where a >= 1 and a <= 4 {code} This would speed up parquet row group filter (currently or(or(or(or(or(eq(x, 1), eq(x, 2)), eq(x, 3), eq(x, 4. and make query more compact was: This expression {code:java} select a from t where a in (1, 2, 3, 3, 4){code} should be translated to {code:java} select a from t where a >= 1 and a <= 4 {code} > Replace IN predicate of continuous range with boundary checks > - > > Key: SPARK-32811 > URL: https://issues.apache.org/jira/browse/SPARK-32811 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Vu Ho >Priority: Minor > > This expression > {code:java} > select a from t where a in (1, 2, 3, 3, 4){code} > can be translated to > {code:java} > select a from t where a >= 1 and a <= 4 {code} > This would speed up parquet row group filter (currently or(or(or(or(or(eq(x, > 1), eq(x, 2)), eq(x, 3), eq(x, 4. and make query more compact > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Bai updated SPARK-32821: --- Description: import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count() but not support group by with window in sql style as below: "select ts_field,count(*) as cnt over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field" was: import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count() but not support group by with window in sql style as below: "select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field" > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Johnny Bai >Priority: Major > Fix For: 3.0.1 > > > > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Bai updated SPARK-32821: --- Description: import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count() but not support group by with window in sql style as below: "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field" was: import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count() but not support group by with window in sql style as below: "select ts_field,count(*) as cnt over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field" > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Johnny Bai >Priority: Major > Fix For: 3.0.1 > > > > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 > minute') with watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Bai updated SPARK-32821: --- Description: import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count() but not support group by with window in sql style as below: "select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field" was: import spark.implicits._ val words = ... // streaming DataFrame of schema \{ timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count() {{but not support group by with window in sql style as below:}} {{"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field"}} > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Johnny Bai >Priority: Major > Fix For: 3.0.1 > > > > import spark.implicits._ > val words = ... // streaming DataFrame of schema { timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > but not support group by with window in sql style as below: > "select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with > watermark 1 minute from tableX group by ts_field" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
[ https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Bai updated SPARK-32821: --- Description: import spark.implicits._ val words = ... // streaming DataFrame of schema \{ timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count() {{but not support group by with window in sql style as below:}} {{"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field"}} was: {{import spark.implicits._}} {{val words = ... // streaming DataFrame of schema \{ timestamp: Timestamp, word: String }}} {{// Group the data by window and word and compute the count of each group}} {{val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count()}} {{}} {{but not support group by with window in sql style as below:}} {{"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field"}} {{}} > cannot group by with window in sql sentence for structured streaming with > watermark > --- > > Key: SPARK-32821 > URL: https://issues.apache.org/jira/browse/SPARK-32821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Johnny Bai >Priority: Major > Fix For: 3.0.1 > > > > > import spark.implicits._ > val words = ... // streaming DataFrame of schema \{ timestamp: Timestamp, > word: String } > // Group the data by window and word and compute the count of each > group > val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 > minutes"),$"word").count() > > {{but not support group by with window in sql style as below:}} > {{"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') > with watermark 1 minute from tableX group by ts_field"}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32638) WidenSetOperationTypes in subquery attribute missing
[ https://issues.apache.org/jira/browse/SPARK-32638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192111#comment-17192111 ] Apache Spark commented on SPARK-32638: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/29680 > WidenSetOperationTypes in subquery attribute missing > -- > > Key: SPARK-32638 > URL: https://issues.apache.org/jira/browse/SPARK-32638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Guojian Li >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.1.0 > > > I am migrating sql from mysql to spark sql, meet a very strange case. Below > is code to reproduce the exception: > > {code:java} > val spark = SparkSession.builder() > .master("local") > .appName("Word Count") > .getOrCreate() > spark.sparkContext.setLogLevel("TRACE") > val DecimalType = DataTypes.createDecimalType(20, 2) > val schema = StructType(List( > StructField("a", DecimalType, true) > )) > val dataList = new util.ArrayList[Row]() > val df=spark.createDataFrame(dataList,schema) > df.printSchema() > df.createTempView("test") > val sql= > """ > |SELECT t.kpi_04 FROM > |( > | SELECT a as `kpi_04` FROM test > | UNION ALL > | SELECT a+a as `kpi_04` FROM test > |) t > | > """.stripMargin > spark.sql(sql) > {code} > > Exception Message: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved > attribute(s) kpi_04#2 missing from kpi_04#4 in operator !Project [kpi_04#2]. > Attribute(s) with the same name appear in the operation: kpi_04. Please check > if the right attribute(s) are used.;; > !Project [kpi_04#2] > +- SubqueryAlias t > +- Union > :- Project [cast(kpi_04#2 as decimal(21,2)) AS kpi_04#4] > : +- Project [a#0 AS kpi_04#2] > : +- SubqueryAlias test > : +- LocalRelation , [a#0] > +- Project [kpi_04#3] > +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + > promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS > kpi_04#3] > +- SubqueryAlias test > +- LocalRelation , [a#0]{code} > > > Base the trace log ,seemly the WidenSetOperationTypes add new outer project > layer. It caused the parent query lose the reference to subquery. > > > {code:java} > > === Applying Rule > org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes === > !'Project [kpi_04#2] !Project [kpi_04#2] > !+- 'SubqueryAlias t +- SubqueryAlias t > ! +- 'Union +- Union > ! :- Project [a#0 AS kpi_04#2] :- Project [cast(kpi_04#2 as decimal(21,2)) AS > kpi_04#4] > ! : +- SubqueryAlias test : +- Project [a#0 AS kpi_04#2] > ! : +- LocalRelation , [a#0] : +- SubqueryAlias test > ! +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + > promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS > kpi_04#3] : +- LocalRelation , [a#0] > ! +- SubqueryAlias test +- Project [kpi_04#3] > ! +- LocalRelation , [a#0] +- Project > [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + > promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS > kpi_04#3] > ! +- SubqueryAlias test > ! +- LocalRelation , [a#0] > {code} > > in the source code ,WidenSetOperationTypes.scala. it is a intent behavior, > but possibly miss this edge case. > I hope someone can help me out to fix it . > > > {code:java} > if (targetTypes.nonEmpty) { > // Add an extra Project if the targetTypes are different from the original > types. > children.map(widenTypes(_, targetTypes)) > } else { > // Unable to find a target type to widen, then just return the original set. > children > }{code} > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32638) WidenSetOperationTypes in subquery attribute missing
[ https://issues.apache.org/jira/browse/SPARK-32638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192110#comment-17192110 ] Apache Spark commented on SPARK-32638: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/29680 > WidenSetOperationTypes in subquery attribute missing > -- > > Key: SPARK-32638 > URL: https://issues.apache.org/jira/browse/SPARK-32638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Guojian Li >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.1.0 > > > I am migrating sql from mysql to spark sql, meet a very strange case. Below > is code to reproduce the exception: > > {code:java} > val spark = SparkSession.builder() > .master("local") > .appName("Word Count") > .getOrCreate() > spark.sparkContext.setLogLevel("TRACE") > val DecimalType = DataTypes.createDecimalType(20, 2) > val schema = StructType(List( > StructField("a", DecimalType, true) > )) > val dataList = new util.ArrayList[Row]() > val df=spark.createDataFrame(dataList,schema) > df.printSchema() > df.createTempView("test") > val sql= > """ > |SELECT t.kpi_04 FROM > |( > | SELECT a as `kpi_04` FROM test > | UNION ALL > | SELECT a+a as `kpi_04` FROM test > |) t > | > """.stripMargin > spark.sql(sql) > {code} > > Exception Message: > > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved > attribute(s) kpi_04#2 missing from kpi_04#4 in operator !Project [kpi_04#2]. > Attribute(s) with the same name appear in the operation: kpi_04. Please check > if the right attribute(s) are used.;; > !Project [kpi_04#2] > +- SubqueryAlias t > +- Union > :- Project [cast(kpi_04#2 as decimal(21,2)) AS kpi_04#4] > : +- Project [a#0 AS kpi_04#2] > : +- SubqueryAlias test > : +- LocalRelation , [a#0] > +- Project [kpi_04#3] > +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + > promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS > kpi_04#3] > +- SubqueryAlias test > +- LocalRelation , [a#0]{code} > > > Base the trace log ,seemly the WidenSetOperationTypes add new outer project > layer. It caused the parent query lose the reference to subquery. > > > {code:java} > > === Applying Rule > org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes === > !'Project [kpi_04#2] !Project [kpi_04#2] > !+- 'SubqueryAlias t +- SubqueryAlias t > ! +- 'Union +- Union > ! :- Project [a#0 AS kpi_04#2] :- Project [cast(kpi_04#2 as decimal(21,2)) AS > kpi_04#4] > ! : +- SubqueryAlias test : +- Project [a#0 AS kpi_04#2] > ! : +- LocalRelation , [a#0] : +- SubqueryAlias test > ! +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + > promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS > kpi_04#3] : +- LocalRelation , [a#0] > ! +- SubqueryAlias test +- Project [kpi_04#3] > ! +- LocalRelation , [a#0] +- Project > [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + > promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS > kpi_04#3] > ! +- SubqueryAlias test > ! +- LocalRelation , [a#0] > {code} > > in the source code ,WidenSetOperationTypes.scala. it is a intent behavior, > but possibly miss this edge case. > I hope someone can help me out to fix it . > > > {code:java} > if (targetTypes.nonEmpty) { > // Add an extra Project if the targetTypes are different from the original > types. > children.map(widenTypes(_, targetTypes)) > } else { > // Unable to find a target type to widen, then just return the original set. > children > }{code} > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32182) Getting Started - Quickstart
[ https://issues.apache.org/jira/browse/SPARK-32182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192103#comment-17192103 ] Apache Spark commented on SPARK-32182: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29679 > Getting Started - Quickstart > > > Key: SPARK-32182 > URL: https://issues.apache.org/jira/browse/SPARK-32182 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Example: > https://koalas.readthedocs.io/en/latest/getting_started/10min.html > https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html > https://pandas.pydata.org/docs/getting_started/10min.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32182) Getting Started - Quickstart
[ https://issues.apache.org/jira/browse/SPARK-32182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192102#comment-17192102 ] Apache Spark commented on SPARK-32182: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29679 > Getting Started - Quickstart > > > Key: SPARK-32182 > URL: https://issues.apache.org/jira/browse/SPARK-32182 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Example: > https://koalas.readthedocs.io/en/latest/getting_started/10min.html > https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html > https://pandas.pydata.org/docs/getting_started/10min.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark
Johnny Bai created SPARK-32821: -- Summary: cannot group by with window in sql sentence for structured streaming with watermark Key: SPARK-32821 URL: https://issues.apache.org/jira/browse/SPARK-32821 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Johnny Bai Fix For: 3.0.1 {{import spark.implicits._}} {{val words = ... // streaming DataFrame of schema \{ timestamp: Timestamp, word: String }}} {{// Group the data by window and word and compute the count of each group}} {{val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word").count()}} {{}} {{but not support group by with window in sql style as below:}} {{"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with watermark 1 minute from tableX group by ts_field"}} {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32182) Getting Started - Quickstart
[ https://issues.apache.org/jira/browse/SPARK-32182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192101#comment-17192101 ] Apache Spark commented on SPARK-32182: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29679 > Getting Started - Quickstart > > > Key: SPARK-32182 > URL: https://issues.apache.org/jira/browse/SPARK-32182 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Example: > https://koalas.readthedocs.io/en/latest/getting_started/10min.html > https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html > https://pandas.pydata.org/docs/getting_started/10min.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32204) Binder Integration
[ https://issues.apache.org/jira/browse/SPARK-32204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192100#comment-17192100 ] Apache Spark commented on SPARK-32204: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29679 > Binder Integration > -- > > Key: SPARK-32204 > URL: https://issues.apache.org/jira/browse/SPARK-32204 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > For example, > https://github.com/databricks/koalas > https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32204) Binder Integration
[ https://issues.apache.org/jira/browse/SPARK-32204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192097#comment-17192097 ] Apache Spark commented on SPARK-32204: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29679 > Binder Integration > -- > > Key: SPARK-32204 > URL: https://issues.apache.org/jira/browse/SPARK-32204 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > For example, > https://github.com/databricks/koalas > https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32815) Fix LibSVM data source loading error on file paths with glob metacharacters
[ https://issues.apache.org/jira/browse/SPARK-32815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192093#comment-17192093 ] Apache Spark commented on SPARK-32815: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29678 > Fix LibSVM data source loading error on file paths with glob metacharacters > --- > > Key: SPARK-32815 > URL: https://issues.apache.org/jira/browse/SPARK-32815 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > SPARK-32810 fixed a long standing bug in a few Spark built-in data sources > that fails to read files whose names contain glob metacharacters, such as [, > ], \{, }, etc. > CSV and JSON data source on the Spark side were affected. We've also noticed > that the LibSVM data source had the same code pattern that leads to the bug, > so the fix https://github.com/apache/spark/pull/29659 included a fix for that > data source as well, but it did not include a test for the LibSVM data source. > This ticket tracks adding a test case for LibSVM, similar to the ones for > CSV/JSON, to verify whether or not the fix works as intended. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32811) Replace IN predicate of continuous range with boundary checks
[ https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vu Ho updated SPARK-32811: -- Priority: Minor (was: Major) > Replace IN predicate of continuous range with boundary checks > - > > Key: SPARK-32811 > URL: https://issues.apache.org/jira/browse/SPARK-32811 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Vu Ho >Priority: Minor > > This expression > {code:java} > select a from t where a in (1, 2, 3, 3, 4){code} > should be translated to > {code:java} > select a from t where a >= 1 and a <= 4 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32820) Remove redundant shuffle exchanges inserted by EnsureRequirements
[ https://issues.apache.org/jira/browse/SPARK-32820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32820: Assignee: Kousuke Saruta (was: Apache Spark) > Remove redundant shuffle exchanges inserted by EnsureRequirements > - > > Key: SPARK-32820 > URL: https://issues.apache.org/jira/browse/SPARK-32820 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Redundant repartition operations are removed by CollapseRepartition rule but > EnsureRequirements can insert another HashPartitioning or RangePartitioning > immediately after the repartition, leading adjacent ShuffleExchanges will be > in the physical plan. > > {code:java} > val ordered = spark.range(1, 100).repartitionByRange(10, > $"id".desc).orderBy($"id") > ordered.explain(true) > ... > == Physical Plan == > *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200), true, [id=#15] >+- Exchange rangepartitioning(id#0L DESC NULLS LAST, 10), false, [id=#14] > +- *(1) Range (1, 100, step=1, splits=12){code} > {code:java} > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 0) > val left = Seq(1,2,3).toDF.repartition(10, $"value") > val right = Seq(1,2,3).toDF > val joined = left.join(right, left("value") + 1 === right("value") > joined.explain(true) > ... > == Physical Plan == > *(3) SortMergeJoin [(value#7 + 1)], [value#12], Inner > :- *(1) Sort [(value#7 + 1) ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning((value#7 + 1), 200), true, [id=#67] > : +- Exchange hashpartitioning(value#7, 10), false, [id=#63] > :+- LocalTableScan [value#7] > +- *(2) Sort [value#12 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(value#12, 200), true, [id=#68] > +- LocalTableScan [value#12]{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32820) Remove redundant shuffle exchanges inserted by EnsureRequirements
[ https://issues.apache.org/jira/browse/SPARK-32820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32820: Assignee: Apache Spark (was: Kousuke Saruta) > Remove redundant shuffle exchanges inserted by EnsureRequirements > - > > Key: SPARK-32820 > URL: https://issues.apache.org/jira/browse/SPARK-32820 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > Redundant repartition operations are removed by CollapseRepartition rule but > EnsureRequirements can insert another HashPartitioning or RangePartitioning > immediately after the repartition, leading adjacent ShuffleExchanges will be > in the physical plan. > > {code:java} > val ordered = spark.range(1, 100).repartitionByRange(10, > $"id".desc).orderBy($"id") > ordered.explain(true) > ... > == Physical Plan == > *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200), true, [id=#15] >+- Exchange rangepartitioning(id#0L DESC NULLS LAST, 10), false, [id=#14] > +- *(1) Range (1, 100, step=1, splits=12){code} > {code:java} > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 0) > val left = Seq(1,2,3).toDF.repartition(10, $"value") > val right = Seq(1,2,3).toDF > val joined = left.join(right, left("value") + 1 === right("value") > joined.explain(true) > ... > == Physical Plan == > *(3) SortMergeJoin [(value#7 + 1)], [value#12], Inner > :- *(1) Sort [(value#7 + 1) ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning((value#7 + 1), 200), true, [id=#67] > : +- Exchange hashpartitioning(value#7, 10), false, [id=#63] > :+- LocalTableScan [value#7] > +- *(2) Sort [value#12 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(value#12, 200), true, [id=#68] > +- LocalTableScan [value#12]{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32820) Remove redundant shuffle exchanges inserted by EnsureRequirements
[ https://issues.apache.org/jira/browse/SPARK-32820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192081#comment-17192081 ] Apache Spark commented on SPARK-32820: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/29677 > Remove redundant shuffle exchanges inserted by EnsureRequirements > - > > Key: SPARK-32820 > URL: https://issues.apache.org/jira/browse/SPARK-32820 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Redundant repartition operations are removed by CollapseRepartition rule but > EnsureRequirements can insert another HashPartitioning or RangePartitioning > immediately after the repartition, leading adjacent ShuffleExchanges will be > in the physical plan. > > {code:java} > val ordered = spark.range(1, 100).repartitionByRange(10, > $"id".desc).orderBy($"id") > ordered.explain(true) > ... > == Physical Plan == > *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200), true, [id=#15] >+- Exchange rangepartitioning(id#0L DESC NULLS LAST, 10), false, [id=#14] > +- *(1) Range (1, 100, step=1, splits=12){code} > {code:java} > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 0) > val left = Seq(1,2,3).toDF.repartition(10, $"value") > val right = Seq(1,2,3).toDF > val joined = left.join(right, left("value") + 1 === right("value") > joined.explain(true) > ... > == Physical Plan == > *(3) SortMergeJoin [(value#7 + 1)], [value#12], Inner > :- *(1) Sort [(value#7 + 1) ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning((value#7 + 1), 200), true, [id=#67] > : +- Exchange hashpartitioning(value#7, 10), false, [id=#63] > :+- LocalTableScan [value#7] > +- *(2) Sort [value#12 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(value#12, 200), true, [id=#68] > +- LocalTableScan [value#12]{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org