[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table
[ https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157180#comment-17157180 ] Cheng Su commented on SPARK-24528: -- +1 for [~viirya]'s suggestion. I think some change on FileScanRDD and FileSourceScanExec should do the job to preserve ordering property when reading sorted bucketed files (for non-vectorization code path). Though we should selectively enable this feature as for each task we need to keep current row in task memory, for all buckets of this task. So I think we need to be careful to avoid merge too many bucket files and cause OOM on task. I am working on a PR now. > Missing optimization for Aggregations/Windowing on a bucketed table > --- > > Key: SPARK-24528 > URL: https://issues.apache.org/jira/browse/SPARK-24528 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ohad Raviv >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-24528#Closely related to > SPARK-24410, we're trying to optimize a very common use case we have of > getting the most updated row by id from a fact table. > We're saving the table bucketed to skip the shuffle stage, but we're still > "waste" time on the Sort operator evethough the data is already sorted. > here's a good example: > {code:java} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("key", "t1") > .saveAsTable("a1"){code} > {code:java} > sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain > == Physical Plan == > SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, > key#24L, t1, t1#25L, t2, t2#26L))]) > +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, > t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))]) > +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, > Format: Parquet, Location: ...{code} > > and here's a bad example, but more realistic: > {code:java} > sparkSession.sql("set spark.sql.shuffle.partitions=2") > sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain > == Physical Plan == > SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, > key#32L, t1, t1#33L, t2, t2#34L))]) > +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, > t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))]) > +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0 > +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, > Format: Parquet, Location: ... > {code} > > I've traced the problem to DataSourceScanExec#235: > {code:java} > val sortOrder = if (sortColumns.nonEmpty) { > // In case of bucketing, its possible to have multiple files belonging to > the > // same bucket in a given relation. Each of these files are locally sorted > // but those files combined together are not globally sorted. Given that, > // the RDD partition will not be sorted even if the relation has sort > columns set > // Current solution is to check if all the buckets have a single file in it > val files = selectedPartitions.flatMap(partition => partition.files) > val bucketToFilesGrouping = > files.map(_.getPath.getName).groupBy(file => > BucketingUtils.getBucketId(file)) > val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= > 1){code} > so obviously the code avoids dealing with this situation now.. > could you think of a way to solve this or bypass it? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32298) tree models prediction optimization
[ https://issues.apache.org/jira/browse/SPARK-32298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157144#comment-17157144 ] Apache Spark commented on SPARK-32298: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/29095 > tree models prediction optimization > --- > > Key: SPARK-32298 > URL: https://issues.apache.org/jira/browse/SPARK-32298 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > in {{Node}}'s method > > {color:#e0957b}def {color}{color:#c7a65d}predictImpl{color}(features: > Vector): LeafNode > > use while-loop instead of the recursive way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32298) tree models prediction optimization
[ https://issues.apache.org/jira/browse/SPARK-32298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32298: Assignee: Apache Spark > tree models prediction optimization > --- > > Key: SPARK-32298 > URL: https://issues.apache.org/jira/browse/SPARK-32298 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > in {{Node}}'s method > > {color:#e0957b}def {color}{color:#c7a65d}predictImpl{color}(features: > Vector): LeafNode > > use while-loop instead of the recursive way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32298) tree models prediction optimization
[ https://issues.apache.org/jira/browse/SPARK-32298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32298: Assignee: (was: Apache Spark) > tree models prediction optimization > --- > > Key: SPARK-32298 > URL: https://issues.apache.org/jira/browse/SPARK-32298 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > in {{Node}}'s method > > {color:#e0957b}def {color}{color:#c7a65d}predictImpl{color}(features: > Vector): LeafNode > > use while-loop instead of the recursive way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32298) tree models prediction optimization
zhengruifeng created SPARK-32298: Summary: tree models prediction optimization Key: SPARK-32298 URL: https://issues.apache.org/jira/browse/SPARK-32298 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng in {{Node}}'s method {color:#e0957b}def {color}{color:#c7a65d}predictImpl{color}(features: Vector): LeafNode use while-loop instead of the recursive way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32241) Remove empty children of union
[ https://issues.apache.org/jira/browse/SPARK-32241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32241: --- Assignee: Peter Toth > Remove empty children of union > -- > > Key: SPARK-32241 > URL: https://issues.apache.org/jira/browse/SPARK-32241 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Minor > > Empty relation children of a union can be removed. > e.g. the plan of > {noformat} > SELECT c FROM t UNION ALL SELECT c FROM t WHERE FALSE{noformat} > is currently: > {noformat} > == Physical Plan == > Union > :- *(1) Project [value#219 AS c#222] > : +- *(1) LocalTableScan [value#219] > +- LocalTableScan , [c#224]{noformat} > but it could be improved as: > {noformat} > == Physical Plan == > *(1) Project [value#219 AS c#222] > +- *(1) LocalTableScan [value#219]{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32241) Remove empty children of union
[ https://issues.apache.org/jira/browse/SPARK-32241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32241. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29053 [https://github.com/apache/spark/pull/29053] > Remove empty children of union > -- > > Key: SPARK-32241 > URL: https://issues.apache.org/jira/browse/SPARK-32241 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Minor > Fix For: 3.1.0 > > > Empty relation children of a union can be removed. > e.g. the plan of > {noformat} > SELECT c FROM t UNION ALL SELECT c FROM t WHERE FALSE{noformat} > is currently: > {noformat} > == Physical Plan == > Union > :- *(1) Project [value#219 AS c#222] > : +- *(1) LocalTableScan [value#219] > +- LocalTableScan , [c#224]{noformat} > but it could be improved as: > {noformat} > == Physical Plan == > *(1) Project [value#219 AS c#222] > +- *(1) LocalTableScan [value#219]{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver
[ https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157132#comment-17157132 ] Apache Spark commented on SPARK-24983: -- User 'constzhou' has created a pull request for this issue: https://github.com/apache/spark/pull/29094 > Collapsing multiple project statements with dependent When-Otherwise > statements on the same column can OOM the driver > - > > Key: SPARK-24983 > URL: https://issues.apache.org/jira/browse/SPARK-24983 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.3.1 >Reporter: David Vogelbacher >Priority: Major > > I noticed that writing a spark job that includes many sequential > {{when-otherwise}} statements on the same column can easily OOM the driver > while generating the optimized plan because the project node will grow > exponentially in size. > Example: > {noformat} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> val df = Seq("a", "b", "c", "1").toDF("text") > df: org.apache.spark.sql.DataFrame = [text: string] > scala> var dfCaseWhen = df.filter($"text" =!= lit("0")) > dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: > string] > scala> for( a <- 1 to 5) { > | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === > lit(a.toString), lit("r" + a.toString)).otherwise($"text")) > | } > scala> dfCaseWhen.queryExecution.analyzed > res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14] > +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12] >+- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10] > +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8] > +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6] > +- Filter NOT (text#3 = 0) >+- Project [value#1 AS text#3] > +- LocalRelation [value#1] > scala> dfCaseWhen.queryExecution.optimizedPlan > res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) > THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 > ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) > THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 > ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN > (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = > 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN > (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = > 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE > WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va... > {noformat} > As one can see the optimized plan grows exponentially in the number of > {{when-otherwise}} statements here. > I can see that this comes from the {{CollapseProject}} optimizer rule. > Maybe we should put a limit on the resulting size of the project node after > collapsing and only collapse if we stay under the limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32266) Run smoke tests after a commit is pushed
[ https://issues.apache.org/jira/browse/SPARK-32266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157129#comment-17157129 ] Gengliang Wang commented on SPARK-32266: [~hyukjin.kwon]Thanks for the update. > Run smoke tests after a commit is pushed > > > Key: SPARK-32266 > URL: https://issues.apache.org/jira/browse/SPARK-32266 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > Run linter/sbt build/maven build/doc generation on commit pushed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31356) Splitting Aggregate node into separate Aggregate and Serialize for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-31356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157128#comment-17157128 ] Martin Loncaric commented on SPARK-31356: - Actually, there seem to be 3 separate performance issues: 1. unnecessary appendColumns when groupByKey function just returns a subset of columns (though this is hard to get around in a type safe way) 2. unnecessary serialize + deserialize 3. actually the RDD's API is roughly a whole 2x faster. It seems there's a lot of room to improve aggregations > Splitting Aggregate node into separate Aggregate and Serialize for Optimizer > > > Key: SPARK-31356 > URL: https://issues.apache.org/jira/browse/SPARK-31356 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Martin Loncaric >Priority: Major > > Problem: in Datasets API, it is a very common pattern to do something like > this whenever a complex reduce function is needed: > {code:scala} > ds > .groupByKey(_.y) > .reduceGroups((a, b) => {...}) > .map(_._2) > {code} > However, the .map(_._2) step (taking values and throwing keys away) > unfortunately often ends up as an unnecessary serialization during > aggregation step, followed by {{DeserializeToObject + MapElements (from (K, > V) => V) + SerializeFromObject}} in the optimized logical plan. In this > example, it would be more ideal to either skip the > deserialization/serialization or {{Project (from (K, V) => V)}}. Even > manually doing a {{.select(...).as[T]}} to replace the `.map` is quite > tricky, because > * the columns are complicated, like {{[value, > ReduceAggregator(my.data.type)]}}, and seem to be impossible to {{.select}} > * it breaks the nice type checking of Datasets > Proposal: > Change the {{KeyValueGroupedDataset.aggUntyped}} method to (like > {{KeyValueGroupedDataset.cogroup}}) append add both an {{Aggregate node}} and > a {{SerializeFromObject}} node so that the Optimizer can eliminate the > serialization when it is redundant. Change aggregations to emit deserialized > results. > I had 2 ideas for what we could change: either add a new feature to > {{.reduceGroupValues}} that projects to only the necessary columns, or do > this improvement. I thought this would be a better solution because > * it will improve the performance of existing Spark applications with no > modifications > * feature growth is undesirable > Uncertainties: > Affects Version: I'm not sure - if I submit a PR soon, can we get this into > 3.0? Or only 3.1? And I assume we're not adding new features to 2.4? > Complications: Are there any hazards in splitting Aggregation into > Aggregation + SerializeFromObject that I'm not aware of? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32253) Make readability better in the test result logs
[ https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157120#comment-17157120 ] L. C. Hsieh edited comment on SPARK-32253 at 7/14/20, 3:19 AM: --- Looks interesting. Will do some tests. :) was (Author: viirya): Will do some tests. :) > Make readability better in the test result logs > --- > > Key: SPARK-32253 > URL: https://issues.apache.org/jira/browse/SPARK-32253 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, the readability in the logs are not really good. For example, see > https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D > We should have a way to easily see the failed test cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32253) Make readability better in the test result logs
[ https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157120#comment-17157120 ] L. C. Hsieh edited comment on SPARK-32253 at 7/14/20, 3:18 AM: --- Will do some tests. :) was (Author: viirya): Will do some tests. > Make readability better in the test result logs > --- > > Key: SPARK-32253 > URL: https://issues.apache.org/jira/browse/SPARK-32253 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, the readability in the logs are not really good. For example, see > https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D > We should have a way to easily see the failed test cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32253) Make readability better in the test result logs
[ https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157120#comment-17157120 ] L. C. Hsieh commented on SPARK-32253: - Will do some tests. > Make readability better in the test result logs > --- > > Key: SPARK-32253 > URL: https://issues.apache.org/jira/browse/SPARK-32253 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, the readability in the logs are not really good. For example, see > https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D > We should have a way to easily see the failed test cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32264) More resources in Github Actions
[ https://issues.apache.org/jira/browse/SPARK-32264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157100#comment-17157100 ] Hyukjin Kwon commented on SPARK-32264: -- This is in progress at the private mailing list. > More resources in Github Actions > > > Key: SPARK-32264 > URL: https://issues.apache.org/jira/browse/SPARK-32264 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We are currently using free version of Github Actions which only allows 20 > concurrent jobs. This is not enough in the heavy development in Apache spark. > We should have a way to allocate more resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32266) Run smoke tests after a commit is pushed
[ https://issues.apache.org/jira/browse/SPARK-32266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32266. -- Assignee: Dongjoon Hyun Resolution: Fixed > Run smoke tests after a commit is pushed > > > Key: SPARK-32266 > URL: https://issues.apache.org/jira/browse/SPARK-32266 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Dongjoon Hyun >Priority: Major > > Run linter/sbt build/maven build/doc generation on commit pushed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32266) Run smoke tests after a commit is pushed
[ https://issues.apache.org/jira/browse/SPARK-32266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32266: - Fix Version/s: 3.1.0 > Run smoke tests after a commit is pushed > > > Key: SPARK-32266 > URL: https://issues.apache.org/jira/browse/SPARK-32266 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > Run linter/sbt build/maven build/doc generation on commit pushed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32266) Run smoke tests after a commit is pushed
[ https://issues.apache.org/jira/browse/SPARK-32266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157098#comment-17157098 ] Hyukjin Kwon commented on SPARK-32266: -- This was fixed in https://github.com/apache/spark/pull/29076 > Run smoke tests after a commit is pushed > > > Key: SPARK-32266 > URL: https://issues.apache.org/jira/browse/SPARK-32266 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Priority: Major > > Run linter/sbt build/maven build/doc generation on commit pushed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32253) Make readability better in the test result logs
[ https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157097#comment-17157097 ] Hyukjin Kwon commented on SPARK-32253: -- [~Gengliang.Wang] or probably [~viirya] from the watchers :-). Are you guys interested in this? Testing way is pretty easy: you can just make a branch as usual but open a PR against your own master to test it. That will automatically trigger the Github Actions build by using your account in your forked Spark repo. > Make readability better in the test result logs > --- > > Key: SPARK-32253 > URL: https://issues.apache.org/jira/browse/SPARK-32253 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, the readability in the logs are not really good. For example, see > https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D > We should have a way to easily see the failed test cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32296) Flaky Test: submit a barrier ResultStage that requires more slots than current total under local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-32296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157086#comment-17157086 ] Hyukjin Kwon commented on SPARK-32296: -- cc [~jiangxb1987] FYI > Flaky Test: submit a barrier ResultStage that requires more slots than > current total under local-cluster mode > - > > Key: SPARK-32296 > URL: https://issues.apache.org/jira/browse/SPARK-32296 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > 2020-07-13T21:39:28.3795362Z [0m[[0minfo[0m] [0m[31m- submit a barrier > ResultStage that requires more slots than current total under local-cluster > mode *** FAILED *** (5 seconds, 703 milliseconds)[0m[0m > 2020-07-13T21:39:28.3843780Z [0m[[0minfo[0m] [0m[31m Expected exception > org.apache.spark.SparkException to be thrown, but > java.util.concurrent.TimeoutException was thrown > (BarrierStageOnSubmittedSuite.scala:53)[0m[0m > 2020-07-13T21:39:28.3844344Z [0m[[0minfo[0m] [0m[31m > org.scalatest.exceptions.TestFailedException:[0m[0m > 2020-07-13T21:39:28.4058689Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)[0m[0m > 2020-07-13T21:39:28.4059209Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)[0m[0m > 2020-07-13T21:39:28.4175876Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)[0m[0m > 2020-07-13T21:39:28.4176563Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Assertions.intercept(Assertions.scala:814)[0m[0m > 2020-07-13T21:39:28.4176967Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Assertions.intercept$(Assertions.scala:804)[0m[0m > 2020-07-13T21:39:28.4177353Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuite.intercept(FunSuite.scala:1560)[0m[0m > 2020-07-13T21:39:28.4177794Z [0m[[0minfo[0m] [0m[31m at > org.apache.spark.BarrierStageOnSubmittedSuite.testSubmitJob(BarrierStageOnSubmittedSuite.scala:53)[0m[0m > 2020-07-13T21:39:28.4178272Z [0m[[0minfo[0m] [0m[31m at > org.apache.spark.BarrierStageOnSubmittedSuite.$anonfun$new$35(BarrierStageOnSubmittedSuite.scala:240)[0m[0m > 2020-07-13T21:39:28.4178695Z [0m[[0minfo[0m] [0m[31m at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[0m[0m > 2020-07-13T21:39:28.4179081Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[0m[0m > 2020-07-13T21:39:28.4179731Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[0m[0m > 2020-07-13T21:39:28.4180162Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[0m[0m > 2020-07-13T21:39:28.4180550Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Transformer.apply(Transformer.scala:22)[0m[0m > 2020-07-13T21:39:28.4180929Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Transformer.apply(Transformer.scala:20)[0m[0m > 2020-07-13T21:39:28.4181323Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)[0m[0m > 2020-07-13T21:39:28.4181728Z [0m[[0minfo[0m] [0m[31m at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)[0m[0m > 2020-07-13T21:39:28.4223205Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)[0m[0m > 2020-07-13T21:39:28.4223689Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)[0m[0m > 2020-07-13T21:39:28.4224119Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)[0m[0m > 2020-07-13T21:39:28.4224510Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)[0m[0m > 2020-07-13T21:39:28.4224901Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)[0m[0m > 2020-07-13T21:39:28.4225362Z [0m[[0minfo[0m] [0m[31m at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)[0m[0m > 2020-07-13T21:39:28.4225778Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)[0m[0m > 2020-07-13T21:39:28.4226188Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)[0m[0m > 2020-07-13T21:39:28.4226589Z [0m[[0minfo[0m] [0m[31m at > org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)[0m[0m > 2020-07-13T21:39:28.4226997Z [0
[jira] [Created] (SPARK-32297) Flaky Test: YarnClusterSuite 4 test cases
Hyukjin Kwon created SPARK-32297: Summary: Flaky Test: YarnClusterSuite 4 test cases Key: SPARK-32297 URL: https://issues.apache.org/jira/browse/SPARK-32297 Project: Spark Issue Type: Sub-task Components: Tests, YARN Affects Versions: 3.1.0 Reporter: Hyukjin Kwon {code} 2020-07-13T20:04:30.9911637Z [0m[[0minfo[0m] [0m[31m- run Spark in yarn-client mode with different configurations, ensuring redaction *** FAILED *** (3 minutes, 0 seconds)[0m[0m 2020-07-13T20:04:30.9912398Z [0m[[0minfo[0m] [0m[31m The code passed to eventually never returned normally. Attempted 190 times over 3.001191441868 minutes. Last failure message: handle.getState().isFinal() was false. (BaseYarnClusterSuite.scala:170)[0m[0m 2020-07-13T20:04:30.9931230Z [0m[[0minfo[0m] [0m[31m org.scalatest.exceptions.TestFailedDueToTimeoutException:[0m[0m 2020-07-13T20:04:30.9932756Z [0m[[0minfo[0m] [0m[31m at org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)[0m[0m 2020-07-13T20:04:30.9933210Z [0m[[0minfo[0m] [0m[31m at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)[0m[0m 2020-07-13T20:04:30.9933633Z [0m[[0minfo[0m] [0m[31m at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)[0m[0m 2020-07-13T20:04:30.9934024Z [0m[[0minfo[0m] [0m[31m at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)[0m[0m 2020-07-13T20:04:30.9934430Z [0m[[0minfo[0m] [0m[31m at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:308)[0m[0m 2020-07-13T20:04:30.9934824Z [0m[[0minfo[0m] [0m[31m at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:307)[0m[0m 2020-07-13T20:04:30.9935218Z [0m[[0minfo[0m] [0m[31m at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)[0m[0m 2020-07-13T20:04:30.9935655Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.runSpark(BaseYarnClusterSuite.scala:170)[0m[0m 2020-07-13T20:04:31.0012081Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.deploy.yarn.YarnClusterSuite.testBasicYarnApp(YarnClusterSuite.scala:243)[0m[0m 2020-07-13T20:04:31.0013838Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.deploy.yarn.YarnClusterSuite.$anonfun$new$4(YarnClusterSuite.scala:104)[0m[0m 2020-07-13T20:04:31.0015078Z [0m[[0minfo[0m] [0m[31m at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[0m[0m 2020-07-13T20:04:31.0015899Z [0m[[0minfo[0m] [0m[31m at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[0m[0m 2020-07-13T20:04:31.0016423Z [0m[[0minfo[0m] [0m[31m at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[0m[0m 2020-07-13T20:04:31.0016952Z [0m[[0minfo[0m] [0m[31m at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[0m[0m 2020-07-13T20:04:31.0017479Z [0m[[0minfo[0m] [0m[31m at org.scalatest.Transformer.apply(Transformer.scala:22)[0m[0m 2020-07-13T20:04:31.0018599Z [0m[[0minfo[0m] [0m[31m at org.scalatest.Transformer.apply(Transformer.scala:20)[0m[0m 2020-07-13T20:04:31.0019144Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)[0m[0m 2020-07-13T20:04:31.0019692Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)[0m[0m 2020-07-13T20:04:31.0020230Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)[0m[0m 2020-07-13T20:04:31.0020789Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)[0m[0m 2020-07-13T20:04:31.0021285Z [0m[[0minfo[0m] [0m[31m at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)[0m[0m 2020-07-13T20:04:31.0021826Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)[0m[0m 2020-07-13T20:04:31.0022361Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)[0m[0m 2020-07-13T20:04:31.0022913Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)[0m[0m 2020-07-13T20:04:31.0023470Z [0m[[0minfo[0m] [0m[31m at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)[0m[0m 2020-07-13T20:04:31.0024015Z [0m[[0minfo[0m] [0m[31m at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)[0m[0m 2020-07-13T20:04:31.0024534Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)[0m[0m 2020-07-13T20:04:31.0025078Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)[0m[0m 2020-07-13T20:04:31.0025606Z [0m[[0minfo[0m] [0m[31m at org.scalatest.SuperEng
[jira] [Resolved] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation
[ https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32138. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28957 [https://github.com/apache/spark/pull/28957] > Drop Python 2, 3.4 and 3.5 in codes and documentation > - > > Key: SPARK-32138 > URL: https://issues.apache.org/jira/browse/SPARK-32138 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation
[ https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32138: Assignee: Hyukjin Kwon > Drop Python 2, 3.4 and 3.5 in codes and documentation > - > > Key: SPARK-32138 > URL: https://issues.apache.org/jira/browse/SPARK-32138 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32278) Install PyPy3 on Jenkins to enable PySpark tests with PyPy
[ https://issues.apache.org/jira/browse/SPARK-32278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157080#comment-17157080 ] Hyukjin Kwon commented on SPARK-32278: -- Oh, yeah. I noticed this, and forgot to take an action to this JIRA. Sorry for a false alarm - this JIRA can be resolved. > Install PyPy3 on Jenkins to enable PySpark tests with PyPy > -- > > Key: SPARK-32278 > URL: https://issues.apache.org/jira/browse/SPARK-32278 > Project: Spark > Issue Type: Test > Components: Project Infra, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Shane Knapp >Priority: Major > > Current PyPy installed in Jenkins is too old, which is Python 2 compatible. > Python 2 will be dropped at SPARK-32138, and we should now upgrade PyPy to > Python 3 compatible PyPy 3. > See also: > https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R160 > https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/python/run-tests.py#L160 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32278) Install PyPy3 on Jenkins to enable PySpark tests with PyPy
[ https://issues.apache.org/jira/browse/SPARK-32278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32278. -- Resolution: Not A Problem > Install PyPy3 on Jenkins to enable PySpark tests with PyPy > -- > > Key: SPARK-32278 > URL: https://issues.apache.org/jira/browse/SPARK-32278 > Project: Spark > Issue Type: Test > Components: Project Infra, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Shane Knapp >Priority: Major > > Current PyPy installed in Jenkins is too old, which is Python 2 compatible. > Python 2 will be dropped at SPARK-32138, and we should now upgrade PyPy to > Python 3 compatible PyPy 3. > See also: > https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R160 > https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/python/run-tests.py#L160 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32279) Install Sphinx in Python 3 on Jenkins machines
[ https://issues.apache.org/jira/browse/SPARK-32279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157078#comment-17157078 ] Hyukjin Kwon commented on SPARK-32279: -- I believe any version is fine. Probably the latest one :-). > Install Sphinx in Python 3 on Jenkins machines > -- > > Key: SPARK-32279 > URL: https://issues.apache.org/jira/browse/SPARK-32279 > Project: Spark > Issue Type: Test > Components: Project Infra, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Shane Knapp >Priority: Major > > Currently Sphinx is only installed in Python 2. We should install it in > Python 3 and test it in Jenkins as Python 2, 3.4 and 3.5 were dropped at > SPARK-32138. > See also: > https://github.com/apache/spark/pull/28957/files#diff-ccd847a0316575dde31bd89786bbe1f2R176 > https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/dev/lint-python#L176 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32146) ValueError when loading a PipelineModel on a personal computer
[ https://issues.apache.org/jira/browse/SPARK-32146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-32146. -- Resolution: Invalid Please use user mailing list regarding question. If your issue is bound to the specific vendor, please go through support line on the vendor. > ValueError when loading a PipelineModel on a personal computer > -- > > Key: SPARK-32146 > URL: https://issues.apache.org/jira/browse/SPARK-32146 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.5 > Environment: * OS: Windows > * SparkSession: spark = > SparkSession.builder.appName({color:#6a8759}"annonces_organiques"{color}).getOrCreate() >Reporter: LoicH >Priority: Major > > I have a PipelineModel saved on my computer that I can't load using > {{PipelineModel.load(path)}}. > When I launch my code in a Databricks cluster, it works. {{path}} is the path > to my model saved on DBFS, accessible via a mount point: {{path = > "/dbfs/path/to/my/model}}. > However on my machine, calling > {{PipelineModel.load("C:\\Users\\path\\to\\my\\model")}} throws a > {{ValueError("RDD is empty")}}. > Here is how the model is saved on my computer: > {code:title=pipeline.txt} > \---model > +---metadata > | part-0 > | _SUCCESS > | > \---stages > +---0_CountVectorizer_b92625354bf7 > | +---data > | | > part-0-tid-9156766819779394023-5cf6aecb-8959-48b3-be24-65bfa0543465-62-1-c000.snappy.parquet > | | _committed_9156766819779394023 > | | _started_9156766819779394023 > | | _SUCCESS > | | > | \---metadata > | part-0 > | _SUCCESS > | > \---1_LinearSVC_108fa01daf43 > +---data > | > part-0-tid-4403060754466700849-27841dd9-de88-4015-9dfa-7854c2a15f15-65-1-c000.snappy.parquet > | _committed_4403060754466700849 > | _started_4403060754466700849 > | _SUCCESS > | > \---metadata > part-0 > _SUCCESS > {code} > (I just downloaded the model from my DataLake to my computer) > How can I load this model when running my code in local? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32146) ValueError when loading a PipelineModel on a personal computer
[ https://issues.apache.org/jira/browse/SPARK-32146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-32146: - Priority: Major (was: Blocker) > ValueError when loading a PipelineModel on a personal computer > -- > > Key: SPARK-32146 > URL: https://issues.apache.org/jira/browse/SPARK-32146 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.5 > Environment: * OS: Windows > * SparkSession: spark = > SparkSession.builder.appName({color:#6a8759}"annonces_organiques"{color}).getOrCreate() >Reporter: LoicH >Priority: Major > > I have a PipelineModel saved on my computer that I can't load using > {{PipelineModel.load(path)}}. > When I launch my code in a Databricks cluster, it works. {{path}} is the path > to my model saved on DBFS, accessible via a mount point: {{path = > "/dbfs/path/to/my/model}}. > However on my machine, calling > {{PipelineModel.load("C:\\Users\\path\\to\\my\\model")}} throws a > {{ValueError("RDD is empty")}}. > Here is how the model is saved on my computer: > {code:title=pipeline.txt} > \---model > +---metadata > | part-0 > | _SUCCESS > | > \---stages > +---0_CountVectorizer_b92625354bf7 > | +---data > | | > part-0-tid-9156766819779394023-5cf6aecb-8959-48b3-be24-65bfa0543465-62-1-c000.snappy.parquet > | | _committed_9156766819779394023 > | | _started_9156766819779394023 > | | _SUCCESS > | | > | \---metadata > | part-0 > | _SUCCESS > | > \---1_LinearSVC_108fa01daf43 > +---data > | > part-0-tid-4403060754466700849-27841dd9-de88-4015-9dfa-7854c2a15f15-65-1-c000.snappy.parquet > | _committed_4403060754466700849 > | _started_4403060754466700849 > | _SUCCESS > | > \---metadata > part-0 > _SUCCESS > {code} > (I just downloaded the model from my DataLake to my computer) > How can I load this model when running my code in local? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32259) tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s
[ https://issues.apache.org/jira/browse/SPARK-32259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157069#comment-17157069 ] Jungtaek Lim commented on SPARK-32259: -- Lowering the priority, as Critical+ requires committer's judgement. > tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s > --- > > Key: SPARK-32259 > URL: https://issues.apache.org/jira/browse/SPARK-32259 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Prakash Rajendran >Priority: Major > Attachments: Capture.PNG > > > In Spark-Submit, I have these config > "{color:#4c9aff}*spark.kubernetes.local.dirs.tmpfs=true*{color}", still spark > is not pointing its spill data to SPARK_LOCAL_DIRS path. > K8s is evicting the pod due to error "{color:#de350b}*Pod ephemeral local > storage usage exceeds the total limit of containers.*{color}" > > We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod > logs for stack trace is not available. we have only pod events given in > attachment > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32197) 'Spark driver' stays running even though 'spark application' has FAILED
[ https://issues.apache.org/jira/browse/SPARK-32197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157070#comment-17157070 ] Jungtaek Lim commented on SPARK-32197: -- Lowering the priority, as Critical+ requires committer's judgement. > 'Spark driver' stays running even though 'spark application' has FAILED > --- > > Key: SPARK-32197 > URL: https://issues.apache.org/jira/browse/SPARK-32197 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.6 >Reporter: t oo >Priority: Major > Attachments: app_executors.png, applog.txt, driverlog.txt, > failed1.png, failed_stages.png, failedapp.png, j1.out, stuckdriver.png > > > App failed in 6 minutes, driver has been stuck for > 8 hours. I would expect > driver to fail if app fails. > > Thread dump from jstack (on the driver pid) attached (j1.out) > Last part of stdout driver log attached (full log is 23MB, stderr log just > has launch command) > Last part of app logs attached > > Can see that "org.apache.spark.util.ShutdownHookManager - Shutdown hook > called" line never appears in the driver log after > "org.apache.spark.SparkContext - Successfully stopped SparkContext" > > Using spark 2.4.6 with spark standalone mode. spark-submit to REST API (port > 6066) in cluster mode was used. Other drivers/apps have worked fine with this > setup, just this one getting stuck. My cluster has 1 EC2 dedicated as spark > master and 1 Spot EC2 dedicated as spark worker. They can auto heal/spot > terminate at any time. From checking aws logs: the worker was terminated at > 01:53:38 > > I think you can replicate this by tearing down worker machine while app is > running. You might have to try several times. > > Similar to https://issues.apache.org/jira/browse/SPARK-24617 i raised before! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32197) 'Spark driver' stays running even though 'spark application' has FAILED
[ https://issues.apache.org/jira/browse/SPARK-32197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-32197: - Priority: Major (was: Blocker) > 'Spark driver' stays running even though 'spark application' has FAILED > --- > > Key: SPARK-32197 > URL: https://issues.apache.org/jira/browse/SPARK-32197 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.6 >Reporter: t oo >Priority: Major > Attachments: app_executors.png, applog.txt, driverlog.txt, > failed1.png, failed_stages.png, failedapp.png, j1.out, stuckdriver.png > > > App failed in 6 minutes, driver has been stuck for > 8 hours. I would expect > driver to fail if app fails. > > Thread dump from jstack (on the driver pid) attached (j1.out) > Last part of stdout driver log attached (full log is 23MB, stderr log just > has launch command) > Last part of app logs attached > > Can see that "org.apache.spark.util.ShutdownHookManager - Shutdown hook > called" line never appears in the driver log after > "org.apache.spark.SparkContext - Successfully stopped SparkContext" > > Using spark 2.4.6 with spark standalone mode. spark-submit to REST API (port > 6066) in cluster mode was used. Other drivers/apps have worked fine with this > setup, just this one getting stuck. My cluster has 1 EC2 dedicated as spark > master and 1 Spot EC2 dedicated as spark worker. They can auto heal/spot > terminate at any time. From checking aws logs: the worker was terminated at > 01:53:38 > > I think you can replicate this by tearing down worker machine while app is > running. You might have to try several times. > > Similar to https://issues.apache.org/jira/browse/SPARK-24617 i raised before! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32259) tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s
[ https://issues.apache.org/jira/browse/SPARK-32259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-32259: - Priority: Major (was: Blocker) > tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s > --- > > Key: SPARK-32259 > URL: https://issues.apache.org/jira/browse/SPARK-32259 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Prakash Rajendran >Priority: Major > Attachments: Capture.PNG > > > In Spark-Submit, I have these config > "{color:#4c9aff}*spark.kubernetes.local.dirs.tmpfs=true*{color}", still spark > is not pointing its spill data to SPARK_LOCAL_DIRS path. > K8s is evicting the pod due to error "{color:#de350b}*Pod ephemeral local > storage usage exceeds the total limit of containers.*{color}" > > We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod > logs for stack trace is not available. we have only pod events given in > attachment > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32220) Cartesian Product Hint cause data error
[ https://issues.apache.org/jira/browse/SPARK-32220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157066#comment-17157066 ] Apache Spark commented on SPARK-32220: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/29093 > Cartesian Product Hint cause data error > --- > > Key: SPARK-32220 > URL: https://issues.apache.org/jira/browse/SPARK-32220 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Blocker > Labels: correctness > Fix For: 3.0.1, 3.1.0 > > > {code:java} > spark-sql> select * from test4 order by a asc; > 1 2 > Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO > SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s) > spark-sql>select * from test5 order by a asc > 1 2 > 2 2 > Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO > SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar > spark-sql>select /*+ shuffle_replicate_nl(test4) */ * from test4 join test5 > where test4.a = test5.a order by test4.a asc ; > 1 2 1 2 > 1 2 2 2 > Time taken: 0.351 seconds, Fetched 2 row(s) > 20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched > 2 row(s){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32294) GroupedData Pandas UDF 2Gb limit
[ https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157065#comment-17157065 ] Hyukjin Kwon commented on SPARK-32294: -- Thanks for filing the issue, [~Tagar]. > GroupedData Pandas UDF 2Gb limit > > > Key: SPARK-32294 > URL: https://issues.apache.org/jira/browse/SPARK-32294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for > GroupedData, the whole group is passed to Pandas UDF at once, which can cause > various 2Gb limitations on Arrow side (and in current versions of Arrow, also > 2Gb limitation on Netty allocator side) - > https://issues.apache.org/jira/browse/ARROW-4890 > Would be great to consider feeding GroupedData into a pandas UDF in batches > to solve this issue. > cc [~hyukjin.kwon] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32296) Flaky Test: submit a barrier ResultStage that requires more slots than current total under local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-32296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32296: - Component/s: Spark Core > Flaky Test: submit a barrier ResultStage that requires more slots than > current total under local-cluster mode > - > > Key: SPARK-32296 > URL: https://issues.apache.org/jira/browse/SPARK-32296 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > 2020-07-13T21:39:28.3795362Z [0m[[0minfo[0m] [0m[31m- submit a barrier > ResultStage that requires more slots than current total under local-cluster > mode *** FAILED *** (5 seconds, 703 milliseconds)[0m[0m > 2020-07-13T21:39:28.3843780Z [0m[[0minfo[0m] [0m[31m Expected exception > org.apache.spark.SparkException to be thrown, but > java.util.concurrent.TimeoutException was thrown > (BarrierStageOnSubmittedSuite.scala:53)[0m[0m > 2020-07-13T21:39:28.3844344Z [0m[[0minfo[0m] [0m[31m > org.scalatest.exceptions.TestFailedException:[0m[0m > 2020-07-13T21:39:28.4058689Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)[0m[0m > 2020-07-13T21:39:28.4059209Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)[0m[0m > 2020-07-13T21:39:28.4175876Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)[0m[0m > 2020-07-13T21:39:28.4176563Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Assertions.intercept(Assertions.scala:814)[0m[0m > 2020-07-13T21:39:28.4176967Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Assertions.intercept$(Assertions.scala:804)[0m[0m > 2020-07-13T21:39:28.4177353Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuite.intercept(FunSuite.scala:1560)[0m[0m > 2020-07-13T21:39:28.4177794Z [0m[[0minfo[0m] [0m[31m at > org.apache.spark.BarrierStageOnSubmittedSuite.testSubmitJob(BarrierStageOnSubmittedSuite.scala:53)[0m[0m > 2020-07-13T21:39:28.4178272Z [0m[[0minfo[0m] [0m[31m at > org.apache.spark.BarrierStageOnSubmittedSuite.$anonfun$new$35(BarrierStageOnSubmittedSuite.scala:240)[0m[0m > 2020-07-13T21:39:28.4178695Z [0m[[0minfo[0m] [0m[31m at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[0m[0m > 2020-07-13T21:39:28.4179081Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[0m[0m > 2020-07-13T21:39:28.4179731Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[0m[0m > 2020-07-13T21:39:28.4180162Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[0m[0m > 2020-07-13T21:39:28.4180550Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Transformer.apply(Transformer.scala:22)[0m[0m > 2020-07-13T21:39:28.4180929Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.Transformer.apply(Transformer.scala:20)[0m[0m > 2020-07-13T21:39:28.4181323Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)[0m[0m > 2020-07-13T21:39:28.4181728Z [0m[[0minfo[0m] [0m[31m at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)[0m[0m > 2020-07-13T21:39:28.4223205Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)[0m[0m > 2020-07-13T21:39:28.4223689Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)[0m[0m > 2020-07-13T21:39:28.4224119Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)[0m[0m > 2020-07-13T21:39:28.4224510Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)[0m[0m > 2020-07-13T21:39:28.4224901Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)[0m[0m > 2020-07-13T21:39:28.4225362Z [0m[[0minfo[0m] [0m[31m at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)[0m[0m > 2020-07-13T21:39:28.4225778Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)[0m[0m > 2020-07-13T21:39:28.4226188Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)[0m[0m > 2020-07-13T21:39:28.4226589Z [0m[[0minfo[0m] [0m[31m at > org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)[0m[0m > 2020-07-13T21:39:28.4226997Z [0m[[0minfo[0m] [0m[31m at > org.scalatest.FunSu
[jira] [Created] (SPARK-32296) Flaky Test: submit a barrier ResultStage that requires more slots than current total under local-cluster mode
Hyukjin Kwon created SPARK-32296: Summary: Flaky Test: submit a barrier ResultStage that requires more slots than current total under local-cluster mode Key: SPARK-32296 URL: https://issues.apache.org/jira/browse/SPARK-32296 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.1.0 Reporter: Hyukjin Kwon {code} 2020-07-13T21:39:28.3795362Z [0m[[0minfo[0m] [0m[31m- submit a barrier ResultStage that requires more slots than current total under local-cluster mode *** FAILED *** (5 seconds, 703 milliseconds)[0m[0m 2020-07-13T21:39:28.3843780Z [0m[[0minfo[0m] [0m[31m Expected exception org.apache.spark.SparkException to be thrown, but java.util.concurrent.TimeoutException was thrown (BarrierStageOnSubmittedSuite.scala:53)[0m[0m 2020-07-13T21:39:28.3844344Z [0m[[0minfo[0m] [0m[31m org.scalatest.exceptions.TestFailedException:[0m[0m 2020-07-13T21:39:28.4058689Z [0m[[0minfo[0m] [0m[31m at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)[0m[0m 2020-07-13T21:39:28.4059209Z [0m[[0minfo[0m] [0m[31m at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)[0m[0m 2020-07-13T21:39:28.4175876Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)[0m[0m 2020-07-13T21:39:28.4176563Z [0m[[0minfo[0m] [0m[31m at org.scalatest.Assertions.intercept(Assertions.scala:814)[0m[0m 2020-07-13T21:39:28.4176967Z [0m[[0minfo[0m] [0m[31m at org.scalatest.Assertions.intercept$(Assertions.scala:804)[0m[0m 2020-07-13T21:39:28.4177353Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)[0m[0m 2020-07-13T21:39:28.4177794Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.BarrierStageOnSubmittedSuite.testSubmitJob(BarrierStageOnSubmittedSuite.scala:53)[0m[0m 2020-07-13T21:39:28.4178272Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.BarrierStageOnSubmittedSuite.$anonfun$new$35(BarrierStageOnSubmittedSuite.scala:240)[0m[0m 2020-07-13T21:39:28.4178695Z [0m[[0minfo[0m] [0m[31m at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[0m[0m 2020-07-13T21:39:28.4179081Z [0m[[0minfo[0m] [0m[31m at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[0m[0m 2020-07-13T21:39:28.4179731Z [0m[[0minfo[0m] [0m[31m at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[0m[0m 2020-07-13T21:39:28.4180162Z [0m[[0minfo[0m] [0m[31m at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[0m[0m 2020-07-13T21:39:28.4180550Z [0m[[0minfo[0m] [0m[31m at org.scalatest.Transformer.apply(Transformer.scala:22)[0m[0m 2020-07-13T21:39:28.4180929Z [0m[[0minfo[0m] [0m[31m at org.scalatest.Transformer.apply(Transformer.scala:20)[0m[0m 2020-07-13T21:39:28.4181323Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)[0m[0m 2020-07-13T21:39:28.4181728Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)[0m[0m 2020-07-13T21:39:28.4223205Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)[0m[0m 2020-07-13T21:39:28.4223689Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)[0m[0m 2020-07-13T21:39:28.4224119Z [0m[[0minfo[0m] [0m[31m at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)[0m[0m 2020-07-13T21:39:28.4224510Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)[0m[0m 2020-07-13T21:39:28.4224901Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)[0m[0m 2020-07-13T21:39:28.4225362Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)[0m[0m 2020-07-13T21:39:28.4225778Z [0m[[0minfo[0m] [0m[31m at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)[0m[0m 2020-07-13T21:39:28.4226188Z [0m[[0minfo[0m] [0m[31m at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)[0m[0m 2020-07-13T21:39:28.4226589Z [0m[[0minfo[0m] [0m[31m at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)[0m[0m 2020-07-13T21:39:28.4226997Z [0m[[0minfo[0m] [0m[31m at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)[0m[0m 2020-07-13T21:39:28.4227685Z [0m[[0minfo[0m] [0m[31m at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)[0m[0m 2020-07-13T21:39:28.4228069Z [0m[[0minfo[0m] [0m[31m at scala.collection.immutable.List.foreach(List.scala:392)[0m[0m 2020-07-13T21:39:28.4228461Z [0m[[0minfo[0m] [0m[31m at org.scalatest.SuperEngine.trav
[jira] [Assigned] (SPARK-32292) Run only relevant builds in parallel at Github Actions
[ https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32292: Assignee: Hyukjin Kwon (was: Apache Spark) > Run only relevant builds in parallel at Github Actions > -- > > Key: SPARK-32292 > URL: https://issues.apache.org/jira/browse/SPARK-32292 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Jenkins already runs only relevant tests. Github Actions should also reuse > and follow it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32004) Drop references to slave
[ https://issues.apache.org/jira/browse/SPARK-32004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-32004. -- Fix Version/s: 3.1.0 Assignee: Holden Karau Resolution: Fixed > Drop references to slave > > > Key: SPARK-32004 > URL: https://issues.apache.org/jira/browse/SPARK-32004 > Project: Spark > Issue Type: Improvement > Components: Mesos, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > Fix For: 3.1.0 > > > We have a lot of references to "slave" in the code base which doesn't match > the terminology in the rest of our code base and we should clean it up. In > many situations it would be clearer with "executor", "worker", or "replica" > depending on the context (so this is not just a search and replace but > actually read through the code and make it consistent). > > We may want to (in a follow on) explore renaming master to something more > precise. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156976#comment-17156976 ] Apache Spark commented on SPARK-32295: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/29092 > Add not null and size > 0 filters before inner explode to benefit from > predicate pushdown > - > > Key: SPARK-32295 > URL: https://issues.apache.org/jira/browse/SPARK-32295 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > Labels: performance > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156975#comment-17156975 ] Apache Spark commented on SPARK-32295: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/29092 > Add not null and size > 0 filters before inner explode to benefit from > predicate pushdown > - > > Key: SPARK-32295 > URL: https://issues.apache.org/jira/browse/SPARK-32295 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > Labels: performance > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32295: Assignee: Apache Spark > Add not null and size > 0 filters before inner explode to benefit from > predicate pushdown > - > > Key: SPARK-32295 > URL: https://issues.apache.org/jira/browse/SPARK-32295 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Major > Labels: performance > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32295: Assignee: (was: Apache Spark) > Add not null and size > 0 filters before inner explode to benefit from > predicate pushdown > - > > Key: SPARK-32295 > URL: https://issues.apache.org/jira/browse/SPARK-32295 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > Labels: performance > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown
Tanel Kiis created SPARK-32295: -- Summary: Add not null and size > 0 filters before inner explode to benefit from predicate pushdown Key: SPARK-32295 URL: https://issues.apache.org/jira/browse/SPARK-32295 Project: Spark Issue Type: Improvement Components: Optimizer, SQL Affects Versions: 3.1.0 Reporter: Tanel Kiis -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables
[ https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-32234: Target Version/s: 3.0.1 > Spark sql commands are failing on select Queries for the orc tables > > > Key: SPARK-32234 > URL: https://issues.apache.org/jira/browse/SPARK-32234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Saurabh Chawla >Priority: Blocker > > Spark sql commands are failing on select Queries for the orc tables > Steps to reproduce > > {code:java} > val table = """CREATE TABLE `date_dim` ( > `d_date_sk` INT, > `d_date_id` STRING, > `d_date` TIMESTAMP, > `d_month_seq` INT, > `d_week_seq` INT, > `d_quarter_seq` INT, > `d_year` INT, > `d_dow` INT, > `d_moy` INT, > `d_dom` INT, > `d_qoy` INT, > `d_fy_year` INT, > `d_fy_quarter_seq` INT, > `d_fy_week_seq` INT, > `d_day_name` STRING, > `d_quarter_name` STRING, > `d_holiday` STRING, > `d_weekend` STRING, > `d_following_holiday` STRING, > `d_first_dom` INT, > `d_last_dom` INT, > `d_same_day_ly` INT, > `d_same_day_lq` INT, > `d_current_day` STRING, > `d_current_week` STRING, > `d_current_month` STRING, > `d_current_quarter` STRING, > `d_current_year` STRING) > USING orc > LOCATION '/Users/test/tpcds_scale5data/date_dim' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1574682806')""" > spark.sql(table).collect > val u = """select date_dim.d_date_id from date_dim limit 5""" > spark.sql(u).collect > {code} > > > Exception > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 2, 192.168.0.103, executor driver): > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:133) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > > > The reason behind this initBatch is not getting the schema that is needed to > find out the column value in OrcFileFormat.scala > > {code:java} > batchReader.initBatch( > TypeDescription.fromString(resultSchemaString){code} > > Query is working if > {code:java} > val u = """select * from date_dim limit 5"""{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables
[ https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-32234: Priority: Blocker (was: Major) > Spark sql commands are failing on select Queries for the orc tables > > > Key: SPARK-32234 > URL: https://issues.apache.org/jira/browse/SPARK-32234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Saurabh Chawla >Priority: Blocker > > Spark sql commands are failing on select Queries for the orc tables > Steps to reproduce > > {code:java} > val table = """CREATE TABLE `date_dim` ( > `d_date_sk` INT, > `d_date_id` STRING, > `d_date` TIMESTAMP, > `d_month_seq` INT, > `d_week_seq` INT, > `d_quarter_seq` INT, > `d_year` INT, > `d_dow` INT, > `d_moy` INT, > `d_dom` INT, > `d_qoy` INT, > `d_fy_year` INT, > `d_fy_quarter_seq` INT, > `d_fy_week_seq` INT, > `d_day_name` STRING, > `d_quarter_name` STRING, > `d_holiday` STRING, > `d_weekend` STRING, > `d_following_holiday` STRING, > `d_first_dom` INT, > `d_last_dom` INT, > `d_same_day_ly` INT, > `d_same_day_lq` INT, > `d_current_day` STRING, > `d_current_week` STRING, > `d_current_month` STRING, > `d_current_quarter` STRING, > `d_current_year` STRING) > USING orc > LOCATION '/Users/test/tpcds_scale5data/date_dim' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1574682806')""" > spark.sql(table).collect > val u = """select date_dim.d_date_id from date_dim limit 5""" > spark.sql(u).collect > {code} > > > Exception > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 2, 192.168.0.103, executor driver): > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:133) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > > > The reason behind this initBatch is not getting the schema that is needed to > find out the column value in OrcFileFormat.scala > > {code:java} > batchReader.initBatch( > TypeDescription.fromString(resultSchemaString){code} > > Query is working if > {code:java} > val u = """select * from date_dim limit 5"""{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32258) NormalizeFloatingNumbers directly normalizes IF/CaseWhen/Coalesce child expressions
[ https://issues.apache.org/jira/browse/SPARK-32258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156939#comment-17156939 ] Apache Spark commented on SPARK-32258: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29091 > NormalizeFloatingNumbers directly normalizes IF/CaseWhen/Coalesce child > expressions > --- > > Key: SPARK-32258 > URL: https://issues.apache.org/jira/browse/SPARK-32258 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.1.0 > > > Currently NormalizeFloatingNumbers rule treats some expressions as black box > but we can optimize it a bit by normalizing directly the inner children > expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32258) NormalizeFloatingNumbers directly normalizes IF/CaseWhen/Coalesce child expressions
[ https://issues.apache.org/jira/browse/SPARK-32258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156938#comment-17156938 ] Apache Spark commented on SPARK-32258: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29091 > NormalizeFloatingNumbers directly normalizes IF/CaseWhen/Coalesce child > expressions > --- > > Key: SPARK-32258 > URL: https://issues.apache.org/jira/browse/SPARK-32258 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.1.0 > > > Currently NormalizeFloatingNumbers rule treats some expressions as black box > but we can optimize it a bit by normalizing directly the inner children > expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option
[ https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32293: Assignee: (was: Apache Spark) > Inconsistent default unit between Spark memory configs and JVM option > - > > Key: SPARK-32293 > URL: https://issues.apache.org/jira/browse/SPARK-32293 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, > 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Spark's maximum memory can be configured in several ways: > - via Spark config > - command line argument > - environment variables > Both for executors and for the driver the memory can be configured > separately. All of these are following the format of JVM memory > configurations in a way they are using the very same size unit suffixes ("k", > "m", "g" or "t") but there is an inconsistency regarding the default unit. > When no suffix is given then the given amount is passed as it is to the JVM > (to the -Xmx and -Xms options) where this memory options are using bytes as a > default unit, for this please see the example > [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]: > {noformat} > The following examples show how to set the maximum allowed size of allocated > memory to 80 MB using various units: > -Xmx83886080 > -Xmx81920k > -Xmx80m > {noformat} > Although the Spark memory config default suffix unit is "m". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option
[ https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32293: Assignee: Apache Spark > Inconsistent default unit between Spark memory configs and JVM option > - > > Key: SPARK-32293 > URL: https://issues.apache.org/jira/browse/SPARK-32293 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, > 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Attila Zsolt Piros >Assignee: Apache Spark >Priority: Major > > Spark's maximum memory can be configured in several ways: > - via Spark config > - command line argument > - environment variables > Both for executors and for the driver the memory can be configured > separately. All of these are following the format of JVM memory > configurations in a way they are using the very same size unit suffixes ("k", > "m", "g" or "t") but there is an inconsistency regarding the default unit. > When no suffix is given then the given amount is passed as it is to the JVM > (to the -Xmx and -Xms options) where this memory options are using bytes as a > default unit, for this please see the example > [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]: > {noformat} > The following examples show how to set the maximum allowed size of allocated > memory to 80 MB using various units: > -Xmx83886080 > -Xmx81920k > -Xmx80m > {noformat} > Although the Spark memory config default suffix unit is "m". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option
[ https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156917#comment-17156917 ] Apache Spark commented on SPARK-32293: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/29090 > Inconsistent default unit between Spark memory configs and JVM option > - > > Key: SPARK-32293 > URL: https://issues.apache.org/jira/browse/SPARK-32293 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, > 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Spark's maximum memory can be configured in several ways: > - via Spark config > - command line argument > - environment variables > Both for executors and for the driver the memory can be configured > separately. All of these are following the format of JVM memory > configurations in a way they are using the very same size unit suffixes ("k", > "m", "g" or "t") but there is an inconsistency regarding the default unit. > When no suffix is given then the given amount is passed as it is to the JVM > (to the -Xmx and -Xms options) where this memory options are using bytes as a > default unit, for this please see the example > [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]: > {noformat} > The following examples show how to set the maximum allowed size of allocated > memory to 80 MB using various units: > -Xmx83886080 > -Xmx81920k > -Xmx80m > {noformat} > Although the Spark memory config default suffix unit is "m". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32294) GroupedData Pandas UDF 2Gb limit
[ https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-32294: -- Description: `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for GroupedData, the whole group is passed to Pandas UDF at once, which can cause various 2Gb limitations on Arrow side (and in current versions of Arrow, also 2Gb limitation on Netty allocator side) - https://issues.apache.org/jira/browse/ARROW-4890 Would be great to consider feeding GroupedData into a pandas UDF in batches to solve this issue. cc [~hyukjin.kwon] was: `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for GroupedData, the whole group is passed to Pandas UDF as once, which can cause various 2Gb limitations on Arrow side (and in current versions of Arrow, also 2Gb limitation on Netty allocator side) - https://issues.apache.org/jira/browse/ARROW-4890 Would be great to consider feeding GroupedData into a pandas UDF in batches to solve this issue. cc [~hyukjin.kwon] > GroupedData Pandas UDF 2Gb limit > > > Key: SPARK-32294 > URL: https://issues.apache.org/jira/browse/SPARK-32294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for > GroupedData, the whole group is passed to Pandas UDF at once, which can cause > various 2Gb limitations on Arrow side (and in current versions of Arrow, also > 2Gb limitation on Netty allocator side) - > https://issues.apache.org/jira/browse/ARROW-4890 > Would be great to consider feeding GroupedData into a pandas UDF in batches > to solve this issue. > cc [~hyukjin.kwon] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32294) GroupedData Pandas UDF 2Gb limit
Ruslan Dautkhanov created SPARK-32294: - Summary: GroupedData Pandas UDF 2Gb limit Key: SPARK-32294 URL: https://issues.apache.org/jira/browse/SPARK-32294 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0, 3.1.0 Reporter: Ruslan Dautkhanov `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for GroupedData, the whole group is passed to Pandas UDF as once, which can cause various 2Gb limitations on Arrow side (and in current versions of Arrow, also 2Gb limitation on Netty allocator side) - https://issues.apache.org/jira/browse/ARROW-4890 Would be great to consider feeding GroupedData into a pandas UDF in batches to solve this issue. cc [~hyukjin.kwon] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30282) Migrate SHOW TBLPROPERTIES to new framework
[ https://issues.apache.org/jira/browse/SPARK-30282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156900#comment-17156900 ] Apache Spark commented on SPARK-30282: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/28375 > Migrate SHOW TBLPROPERTIES to new framework > --- > > Key: SPARK-30282 > URL: https://issues.apache.org/jira/browse/SPARK-30282 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > For the following v2 commands, _Analyzer.ResolveTables_ does not check > against the temp views before resolving _UnresolvedV2Relation_, thus it > always resolves _UnresolvedV2Relation_ to a table: > * ALTER TABLE > * DESCRIBE TABLE > * SHOW TBLPROPERTIES > Thus, in the following example, 't' will be resolved to a table, not a temp > view: > {code:java} > sql("CREATE TEMPORARY VIEW t AS SELECT 2 AS i") > sql("CREATE TABLE testcat.ns.t USING csv AS SELECT 1 AS i") > sql("USE testcat.ns") > sql("SHOW TBLPROPERTIES t") // 't' is resolved to a table > {code} > For V2 commands, if a table is resolved to a temp view, it should error out > with a message that v2 command cannot handle temp views. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option
[ https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-32293: --- Summary: Inconsistent default unit between Spark memory configs and JVM option (was: Inconsistent default units for configuring Spark memory) > Inconsistent default unit between Spark memory configs and JVM option > - > > Key: SPARK-32293 > URL: https://issues.apache.org/jira/browse/SPARK-32293 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, > 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Spark's maximum memory can be configured in several ways: > - via Spark config > - command line argument > - environment variables > Both for executors and for the driver the memory can be configured > separately. All of these are following the format of JVM memory > configurations in a way they are using the very same size unit suffixes ("k", > "m", "g" or "t") but there is an inconsistency regarding the default unit. > When no suffix is given then the given amount is passed as it is to the JVM > (to the -Xmx and -Xms options) where this memory options are using bytes as a > default unit, for this please see the example > [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]: > {noformat} > The following examples show how to set the maximum allowed size of allocated > memory to 80 MB using various units: > -Xmx83886080 > -Xmx81920k > -Xmx80m > {noformat} > Although the Spark memory config default suffix unit is "m". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32293) Inconsistent default units for configuring Spark memory
[ https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156884#comment-17156884 ] Attila Zsolt Piros commented on SPARK-32293: I am working on this. > Inconsistent default units for configuring Spark memory > --- > > Key: SPARK-32293 > URL: https://issues.apache.org/jira/browse/SPARK-32293 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, > 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Spark's maximum memory can be configured in several ways: > - via Spark config > - command line argument > - environment variables > Both for executors and for the driver the memory can be configured > separately. All of these are following the format of JVM memory > configurations in a way they are using the very same size unit suffixes ("k", > "m", "g" or "t") but there is an inconsistency regarding the default unit. > When no suffix is given then the given amount is passed as it is to the JVM > (to the -Xmx and -Xms options) where this memory options are using bytes as a > default unit, for this please see the example > [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]: > {noformat} > The following examples show how to set the maximum allowed size of allocated > memory to 80 MB using various units: > -Xmx83886080 > -Xmx81920k > -Xmx80m > {noformat} > Although the Spark memory config default suffix unit is "m". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32293) Inconsistent default units for configuring Spark memory
Attila Zsolt Piros created SPARK-32293: -- Summary: Inconsistent default units for configuring Spark memory Key: SPARK-32293 URL: https://issues.apache.org/jira/browse/SPARK-32293 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 3.0.0, 2.4.6, 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 2.3.4, 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.2.3, 2.2.2, 2.2.1, 3.0.1, 3.1.0 Reporter: Attila Zsolt Piros Spark's maximum memory can be configured in several ways: - via Spark config - command line argument - environment variables Both for executors and for the driver the memory can be configured separately. All of these are following the format of JVM memory configurations in a way they are using the very same size unit suffixes ("k", "m", "g" or "t") but there is an inconsistency regarding the default unit. When no suffix is given then the given amount is passed as it is to the JVM (to the -Xmx and -Xms options) where this memory options are using bytes as a default unit, for this please see the example [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]: {noformat} The following examples show how to set the maximum allowed size of allocated memory to 80 MB using various units: -Xmx83886080 -Xmx81920k -Xmx80m {noformat} Although the Spark memory config default suffix unit is "m". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32279) Install Sphinx in Python 3 on Jenkins machines
[ https://issues.apache.org/jira/browse/SPARK-32279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156846#comment-17156846 ] Shane Knapp commented on SPARK-32279: - any particular version of sphinx you want installed? > Install Sphinx in Python 3 on Jenkins machines > -- > > Key: SPARK-32279 > URL: https://issues.apache.org/jira/browse/SPARK-32279 > Project: Spark > Issue Type: Test > Components: Project Infra, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Shane Knapp >Priority: Major > > Currently Sphinx is only installed in Python 2. We should install it in > Python 3 and test it in Jenkins as Python 2, 3.4 and 3.5 were dropped at > SPARK-32138. > See also: > https://github.com/apache/spark/pull/28957/files#diff-ccd847a0316575dde31bd89786bbe1f2R176 > https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/dev/lint-python#L176 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32276) Remove redundant sorts before repartition nodes
[ https://issues.apache.org/jira/browse/SPARK-32276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156844#comment-17156844 ] Apache Spark commented on SPARK-32276: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/29089 > Remove redundant sorts before repartition nodes > --- > > Key: SPARK-32276 > URL: https://issues.apache.org/jira/browse/SPARK-32276 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > I think our {{EliminateSorts}} rule can be extended further to remove sorts > before repartition, repartitionByExpression and coalesce nodes. Independently > of whether we do a shuffle or not, each repartition operation will change the > ordering and distribution of data. > That's why we should be able to rewrite {{Repartition -> Sort -> Scan}} as > {{Repartition -> Scan}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32278) Install PyPy3 on Jenkins to enable PySpark tests with PyPy
[ https://issues.apache.org/jira/browse/SPARK-32278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156842#comment-17156842 ] Shane Knapp edited comment on SPARK-32278 at 7/13/20, 5:03 PM: --- which version of pypy3 are we interested in? we currently have pypy 7.2 (python 3.6.9) installed on the centos workers, and i'd like to nail down a version before i install this on the ubuntu nodes. {{[sknapp@amp-jenkins-worker-05 ~]$ pypy3}} {{Python 3.6.9 (5da45ced70e515f94686be0df47c59abd1348ebc, Oct 17 2019, 22:59:56)}} {{[PyPy 7.2.0 with GCC 8.2.0] on linux}} {{Type "help", "copyright", "credits" or "license" for more information.}} {{}} was (Author: shaneknapp): which version of pypy3 are we interested in? we currently have 3.6.9 installed on the centos workers, and i'd like to nail down a version before i install this on the ubuntu nodes. {{[sknapp@amp-jenkins-worker-05 ~]$ pypy3}} {{Python 3.6.9 (5da45ced70e515f94686be0df47c59abd1348ebc, Oct 17 2019, 22:59:56)}} {{[PyPy 7.2.0 with GCC 8.2.0] on linux}} {{Type "help", "copyright", "credits" or "license" for more information.}} {{}} > Install PyPy3 on Jenkins to enable PySpark tests with PyPy > -- > > Key: SPARK-32278 > URL: https://issues.apache.org/jira/browse/SPARK-32278 > Project: Spark > Issue Type: Test > Components: Project Infra, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Shane Knapp >Priority: Major > > Current PyPy installed in Jenkins is too old, which is Python 2 compatible. > Python 2 will be dropped at SPARK-32138, and we should now upgrade PyPy to > Python 3 compatible PyPy 3. > See also: > https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R160 > https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/python/run-tests.py#L160 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32276) Remove redundant sorts before repartition nodes
[ https://issues.apache.org/jira/browse/SPARK-32276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156843#comment-17156843 ] Apache Spark commented on SPARK-32276: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/29089 > Remove redundant sorts before repartition nodes > --- > > Key: SPARK-32276 > URL: https://issues.apache.org/jira/browse/SPARK-32276 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > I think our {{EliminateSorts}} rule can be extended further to remove sorts > before repartition, repartitionByExpression and coalesce nodes. Independently > of whether we do a shuffle or not, each repartition operation will change the > ordering and distribution of data. > That's why we should be able to rewrite {{Repartition -> Sort -> Scan}} as > {{Repartition -> Scan}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32276) Remove redundant sorts before repartition nodes
[ https://issues.apache.org/jira/browse/SPARK-32276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32276: Assignee: Apache Spark > Remove redundant sorts before repartition nodes > --- > > Key: SPARK-32276 > URL: https://issues.apache.org/jira/browse/SPARK-32276 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Assignee: Apache Spark >Priority: Major > > I think our {{EliminateSorts}} rule can be extended further to remove sorts > before repartition, repartitionByExpression and coalesce nodes. Independently > of whether we do a shuffle or not, each repartition operation will change the > ordering and distribution of data. > That's why we should be able to rewrite {{Repartition -> Sort -> Scan}} as > {{Repartition -> Scan}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32278) Install PyPy3 on Jenkins to enable PySpark tests with PyPy
[ https://issues.apache.org/jira/browse/SPARK-32278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156842#comment-17156842 ] Shane Knapp commented on SPARK-32278: - which version of pypy3 are we interested in? we currently have 3.6.9 installed on the centos workers, and i'd like to nail down a version before i install this on the ubuntu nodes. {{[sknapp@amp-jenkins-worker-05 ~]$ pypy3}} {{Python 3.6.9 (5da45ced70e515f94686be0df47c59abd1348ebc, Oct 17 2019, 22:59:56)}} {{[PyPy 7.2.0 with GCC 8.2.0] on linux}} {{Type "help", "copyright", "credits" or "license" for more information.}} {{}} > Install PyPy3 on Jenkins to enable PySpark tests with PyPy > -- > > Key: SPARK-32278 > URL: https://issues.apache.org/jira/browse/SPARK-32278 > Project: Spark > Issue Type: Test > Components: Project Infra, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Shane Knapp >Priority: Major > > Current PyPy installed in Jenkins is too old, which is Python 2 compatible. > Python 2 will be dropped at SPARK-32138, and we should now upgrade PyPy to > Python 3 compatible PyPy 3. > See also: > https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R160 > https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/python/run-tests.py#L160 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32276) Remove redundant sorts before repartition nodes
[ https://issues.apache.org/jira/browse/SPARK-32276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32276: Assignee: (was: Apache Spark) > Remove redundant sorts before repartition nodes > --- > > Key: SPARK-32276 > URL: https://issues.apache.org/jira/browse/SPARK-32276 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Anton Okolnychyi >Priority: Major > > I think our {{EliminateSorts}} rule can be extended further to remove sorts > before repartition, repartitionByExpression and coalesce nodes. Independently > of whether we do a shuffle or not, each repartition operation will change the > ordering and distribution of data. > That's why we should be able to rewrite {{Repartition -> Sort -> Scan}} as > {{Repartition -> Scan}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32252) Enable doctests in run-tests.py back
[ https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32252: - Assignee: Hyukjin Kwon > Enable doctests in run-tests.py back > > > Key: SPARK-32252 > URL: https://issues.apache.org/jira/browse/SPARK-32252 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} > is set. This is mainly because the doctests fail in Github Actions. > We should test it. Currently it fails as below: > {code} > fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > ** > File "./dev/run-tests.py", line 75, in > __main__.identify_changed_files_from_git_commits > Failed example: > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.6/doctest.py", line 1330, in __run > compileflags, 1), test.globs) > File "", > line 1, in > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > File "./dev/run-tests.py", line 87, in > identify_changed_files_from_git_commits > universal_newlines=True) > File "/usr/lib/python3.6/subprocess.py", line 356, in check_output > **kwargs).stdout > File "/usr/lib/python3.6/subprocess.py", line 438, in run > output=stdout, stderr=stderr) > subprocess.CalledProcessError: Command '['git', 'diff-tree', > '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit > status 128. > fatal: ambiguous argument '50a0496a43': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > {code} > Looks we should fetch the commit to test in GitHub Actions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32252) Enable doctests in run-tests.py back
[ https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32252. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29086 [https://github.com/apache/spark/pull/29086] > Enable doctests in run-tests.py back > > > Key: SPARK-32252 > URL: https://issues.apache.org/jira/browse/SPARK-32252 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} > is set. This is mainly because the doctests fail in Github Actions. > We should test it. Currently it fails as below: > {code} > fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > ** > File "./dev/run-tests.py", line 75, in > __main__.identify_changed_files_from_git_commits > Failed example: > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.6/doctest.py", line 1330, in __run > compileflags, 1), test.globs) > File "", > line 1, in > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > File "./dev/run-tests.py", line 87, in > identify_changed_files_from_git_commits > universal_newlines=True) > File "/usr/lib/python3.6/subprocess.py", line 356, in check_output > **kwargs).stdout > File "/usr/lib/python3.6/subprocess.py", line 438, in run > output=stdout, stderr=stderr) > subprocess.CalledProcessError: Command '['git', 'diff-tree', > '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit > status 128. > fatal: ambiguous argument '50a0496a43': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > {code} > Looks we should fetch the commit to test in GitHub Actions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32292) Run only relevant builds in parallel at Github Actions
[ https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32292. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29086 [https://github.com/apache/spark/pull/29086] > Run only relevant builds in parallel at Github Actions > -- > > Key: SPARK-32292 > URL: https://issues.apache.org/jira/browse/SPARK-32292 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Fix For: 3.1.0 > > > Jenkins already runs only relevant tests. Github Actions should also reuse > and follow it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32289) Chinese characters are garbled when opening csv files with Excel
[ https://issues.apache.org/jira/browse/SPARK-32289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32289: Assignee: (was: Apache Spark) > Chinese characters are garbled when opening csv files with Excel > > > Key: SPARK-32289 > URL: https://issues.apache.org/jira/browse/SPARK-32289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: garbled.png > > > How to reproduce this issue: > {code:scala} > spark.sql("SELECT '我爱中文' AS chinese").write.option("header", > "true").csv("/tmp/spark/csv") > {code} > !garbled.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32289) Chinese characters are garbled when opening csv files with Excel
[ https://issues.apache.org/jira/browse/SPARK-32289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156780#comment-17156780 ] Apache Spark commented on SPARK-32289: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/29088 > Chinese characters are garbled when opening csv files with Excel > > > Key: SPARK-32289 > URL: https://issues.apache.org/jira/browse/SPARK-32289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: garbled.png > > > How to reproduce this issue: > {code:scala} > spark.sql("SELECT '我爱中文' AS chinese").write.option("header", > "true").csv("/tmp/spark/csv") > {code} > !garbled.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32289) Chinese characters are garbled when opening csv files with Excel
[ https://issues.apache.org/jira/browse/SPARK-32289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32289: Assignee: Apache Spark > Chinese characters are garbled when opening csv files with Excel > > > Key: SPARK-32289 > URL: https://issues.apache.org/jira/browse/SPARK-32289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > Attachments: garbled.png > > > How to reproduce this issue: > {code:scala} > spark.sql("SELECT '我爱中文' AS chinese").write.option("header", > "true").csv("/tmp/spark/csv") > {code} > !garbled.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28227) Spark can’t support TRANSFORM with aggregation
[ https://issues.apache.org/jira/browse/SPARK-28227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156698#comment-17156698 ] Apache Spark commented on SPARK-28227: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/29087 > Spark can’t support TRANSFORM with aggregation > --- > > Key: SPARK-28227 > URL: https://issues.apache.org/jira/browse/SPARK-28227 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Spark can;t support using TRANSFORM with aggregation such as : > {code:java} > SELECT TRANSFORM(T.A, SUM(T.B)) > USING 'func' AS (X STRING Y STRING) > FROM DEFAULT.TEST T > WHERE T.C > 0 > GROUP BY T.A{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28227) Spark can’t support TRANSFORM with aggregation
[ https://issues.apache.org/jira/browse/SPARK-28227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156699#comment-17156699 ] Apache Spark commented on SPARK-28227: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/29087 > Spark can’t support TRANSFORM with aggregation > --- > > Key: SPARK-28227 > URL: https://issues.apache.org/jira/browse/SPARK-28227 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Spark can;t support using TRANSFORM with aggregation such as : > {code:java} > SELECT TRANSFORM(T.A, SUM(T.B)) > USING 'func' AS (X STRING Y STRING) > FROM DEFAULT.TEST T > WHERE T.C > 0 > GROUP BY T.A{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32252) Enable doctests in run-tests.py back
[ https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156641#comment-17156641 ] Apache Spark commented on SPARK-32252: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29086 > Enable doctests in run-tests.py back > > > Key: SPARK-32252 > URL: https://issues.apache.org/jira/browse/SPARK-32252 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} > is set. This is mainly because the doctests fail in Github Actions. > We should test it. Currently it fails as below: > {code} > fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > ** > File "./dev/run-tests.py", line 75, in > __main__.identify_changed_files_from_git_commits > Failed example: > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.6/doctest.py", line 1330, in __run > compileflags, 1), test.globs) > File "", > line 1, in > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > File "./dev/run-tests.py", line 87, in > identify_changed_files_from_git_commits > universal_newlines=True) > File "/usr/lib/python3.6/subprocess.py", line 356, in check_output > **kwargs).stdout > File "/usr/lib/python3.6/subprocess.py", line 438, in run > output=stdout, stderr=stderr) > subprocess.CalledProcessError: Command '['git', 'diff-tree', > '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit > status 128. > fatal: ambiguous argument '50a0496a43': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > {code} > Looks we should fetch the commit to test in GitHub Actions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32252) Enable doctests in run-tests.py back
[ https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156639#comment-17156639 ] Apache Spark commented on SPARK-32252: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29086 > Enable doctests in run-tests.py back > > > Key: SPARK-32252 > URL: https://issues.apache.org/jira/browse/SPARK-32252 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} > is set. This is mainly because the doctests fail in Github Actions. > We should test it. Currently it fails as below: > {code} > fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > ** > File "./dev/run-tests.py", line 75, in > __main__.identify_changed_files_from_git_commits > Failed example: > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.6/doctest.py", line 1330, in __run > compileflags, 1), test.globs) > File "", > line 1, in > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > File "./dev/run-tests.py", line 87, in > identify_changed_files_from_git_commits > universal_newlines=True) > File "/usr/lib/python3.6/subprocess.py", line 356, in check_output > **kwargs).stdout > File "/usr/lib/python3.6/subprocess.py", line 438, in run > output=stdout, stderr=stderr) > subprocess.CalledProcessError: Command '['git', 'diff-tree', > '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit > status 128. > fatal: ambiguous argument '50a0496a43': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > {code} > Looks we should fetch the commit to test in GitHub Actions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32292) Run only relevant builds in parallel at Github Actions
[ https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32292: Assignee: Apache Spark > Run only relevant builds in parallel at Github Actions > -- > > Key: SPARK-32292 > URL: https://issues.apache.org/jira/browse/SPARK-32292 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Jenkins already runs only relevant tests. Github Actions should also reuse > and follow it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32252) Enable doctests in run-tests.py back
[ https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32252: Assignee: Apache Spark > Enable doctests in run-tests.py back > > > Key: SPARK-32252 > URL: https://issues.apache.org/jira/browse/SPARK-32252 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} > is set. This is mainly because the doctests fail in Github Actions. > We should test it. Currently it fails as below: > {code} > fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > ** > File "./dev/run-tests.py", line 75, in > __main__.identify_changed_files_from_git_commits > Failed example: > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.6/doctest.py", line 1330, in __run > compileflags, 1), test.globs) > File "", > line 1, in > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > File "./dev/run-tests.py", line 87, in > identify_changed_files_from_git_commits > universal_newlines=True) > File "/usr/lib/python3.6/subprocess.py", line 356, in check_output > **kwargs).stdout > File "/usr/lib/python3.6/subprocess.py", line 438, in run > output=stdout, stderr=stderr) > subprocess.CalledProcessError: Command '['git', 'diff-tree', > '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit > status 128. > fatal: ambiguous argument '50a0496a43': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > {code} > Looks we should fetch the commit to test in GitHub Actions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32292) Run only relevant builds in parallel at Github Actions
[ https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32292: Assignee: Apache Spark > Run only relevant builds in parallel at Github Actions > -- > > Key: SPARK-32292 > URL: https://issues.apache.org/jira/browse/SPARK-32292 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Jenkins already runs only relevant tests. Github Actions should also reuse > and follow it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32252) Enable doctests in run-tests.py back
[ https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32252: Assignee: (was: Apache Spark) > Enable doctests in run-tests.py back > > > Key: SPARK-32252 > URL: https://issues.apache.org/jira/browse/SPARK-32252 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} > is set. This is mainly because the doctests fail in Github Actions. > We should test it. Currently it fails as below: > {code} > fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > ** > File "./dev/run-tests.py", line 75, in > __main__.identify_changed_files_from_git_commits > Failed example: > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.6/doctest.py", line 1330, in __run > compileflags, 1), test.globs) > File "", > line 1, in > [x.name for x in determine_modules_for_files( > identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))] > File "./dev/run-tests.py", line 87, in > identify_changed_files_from_git_commits > universal_newlines=True) > File "/usr/lib/python3.6/subprocess.py", line 356, in check_output > **kwargs).stdout > File "/usr/lib/python3.6/subprocess.py", line 438, in run > output=stdout, stderr=stderr) > subprocess.CalledProcessError: Command '['git', 'diff-tree', > '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit > status 128. > fatal: ambiguous argument '50a0496a43': unknown revision or path not in the > working tree. > Use '--' to separate paths from revisions, like this: > 'git [...] -- [...]' > {code} > Looks we should fetch the commit to test in GitHub Actions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32292) Run only relevant builds in parallel at Github Actions
[ https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156636#comment-17156636 ] Apache Spark commented on SPARK-32292: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29086 > Run only relevant builds in parallel at Github Actions > -- > > Key: SPARK-32292 > URL: https://issues.apache.org/jira/browse/SPARK-32292 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Jenkins already runs only relevant tests. Github Actions should also reuse > and follow it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32292) Run only relevant builds in parallel at Github Actions
[ https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32292: Assignee: (was: Apache Spark) > Run only relevant builds in parallel at Github Actions > -- > > Key: SPARK-32292 > URL: https://issues.apache.org/jira/browse/SPARK-32292 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Jenkins already runs only relevant tests. Github Actions should also reuse > and follow it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32106) Implement script transform in sql/core
[ https://issues.apache.org/jira/browse/SPARK-32106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32106: Assignee: Apache Spark > Implement script transform in sql/core > -- > > Key: SPARK-32106 > URL: https://issues.apache.org/jira/browse/SPARK-32106 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32106) Implement script transform in sql/core
[ https://issues.apache.org/jira/browse/SPARK-32106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156629#comment-17156629 ] Apache Spark commented on SPARK-32106: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/29085 > Implement script transform in sql/core > -- > > Key: SPARK-32106 > URL: https://issues.apache.org/jira/browse/SPARK-32106 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32106) Implement script transform in sql/core
[ https://issues.apache.org/jira/browse/SPARK-32106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32106: Assignee: (was: Apache Spark) > Implement script transform in sql/core > -- > > Key: SPARK-32106 > URL: https://issues.apache.org/jira/browse/SPARK-32106 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32259) tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s
[ https://issues.apache.org/jira/browse/SPARK-32259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156627#comment-17156627 ] Rob Vesse commented on SPARK-32259: --- bq. We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod logs for stack trace is not available. we have only pod events given in attachment You should still be able to use {{kubectl logs}} to retrieve the logs of terminated pods unless these are executor pods that are being evicted since I believe Spark cleans those up automatically. You can add {{spark.kubernetes.executor.deleteOnTermination=false}} to your configuration to disable this behaviour so that you can go and retrieve those logs later. > tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s > --- > > Key: SPARK-32259 > URL: https://issues.apache.org/jira/browse/SPARK-32259 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Prakash Rajendran >Priority: Blocker > Attachments: Capture.PNG > > > In Spark-Submit, I have these config > "{color:#4c9aff}*spark.kubernetes.local.dirs.tmpfs=true*{color}", still spark > is not pointing its spill data to SPARK_LOCAL_DIRS path. > K8s is evicting the pod due to error "{color:#de350b}*Pod ephemeral local > storage usage exceeds the total limit of containers.*{color}" > > We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod > logs for stack trace is not available. we have only pod events given in > attachment > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32259) tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s
[ https://issues.apache.org/jira/browse/SPARK-32259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156626#comment-17156626 ] Rob Vesse edited comment on SPARK-32259 at 7/13/20, 10:32 AM: -- [~prakki79] Ideally you'd also include the following in your report: * The full {{spark-submit}} command * The {{spark-defaults.conf}} or whatever configuration file you are using (if any) * The {{kubectl describe pod}} output for the relevant pod(s) * The {{kubectl get pod -o=yaml}} output for the relevant pod(s) bq. I have these config "spark.kubernetes.local.dirs.tmpfs=true", still spark is not pointing its spill data to SPARK_LOCAL_DIRS path. Nothing you have shown so far suggests that this is true, all that configuration setting does is change how Spark configures the relevant {{emptyDir}} volume used for ephemeral storage (and that's assuming you haven't supplied other configuration that explicitly configures local directories). You can exhaust an in-memory volume in exactly the same as you exhaust a disk based volume and get your pod evicted. Note that when using in-memory volumes then you may need to adjust the amount of memory allocated to your pod per the documentation - http://spark.apache.org/docs/latest/running-on-kubernetes.html#using-ram-for-local-storage was (Author: rvesse): [~prakki79] Ideally you'd also include the following in your report: * The full {{spark-submit}} command * The {{kubectl describe pod}} output for the relevant pod(s) * The {{kubectl get pod -o=yaml}} output for the relevant pod(s) bq. I have these config "spark.kubernetes.local.dirs.tmpfs=true", still spark is not pointing its spill data to SPARK_LOCAL_DIRS path. Nothing you have shown so far suggests that this is true, all that configuration setting does is change how Spark configures the relevant {{emptyDir}} volume used for ephemeral storage (and that's assuming you haven't supplied other configuration that explicitly configures local directories). You can exhaust an in-memory volume in exactly the same as you exhaust a disk based volume and get your pod evicted. Note that when using in-memory volumes then you may need to adjust the amount of memory allocated to your pod per the documentation - http://spark.apache.org/docs/latest/running-on-kubernetes.html#using-ram-for-local-storage > tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s > --- > > Key: SPARK-32259 > URL: https://issues.apache.org/jira/browse/SPARK-32259 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Prakash Rajendran >Priority: Blocker > Attachments: Capture.PNG > > > In Spark-Submit, I have these config > "{color:#4c9aff}*spark.kubernetes.local.dirs.tmpfs=true*{color}", still spark > is not pointing its spill data to SPARK_LOCAL_DIRS path. > K8s is evicting the pod due to error "{color:#de350b}*Pod ephemeral local > storage usage exceeds the total limit of containers.*{color}" > > We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod > logs for stack trace is not available. we have only pod events given in > attachment > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32259) tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s
[ https://issues.apache.org/jira/browse/SPARK-32259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156626#comment-17156626 ] Rob Vesse commented on SPARK-32259: --- [~prakki79] Ideally you'd also include the following in your report: * The full {{spark-submit}} command * The {{kubectl describe pod}} output for the relevant pod(s) * The {{kubectl get pod -o=yaml}} output for the relevant pod(s) bq. I have these config "spark.kubernetes.local.dirs.tmpfs=true", still spark is not pointing its spill data to SPARK_LOCAL_DIRS path. Nothing you have shown so far suggests that this is true, all that configuration setting does is change how Spark configures the relevant {{emptyDir}} volume used for ephemeral storage (and that's assuming you haven't supplied other configuration that explicitly configures local directories). You can exhaust an in-memory volume in exactly the same as you exhaust a disk based volume and get your pod evicted. Note that when using in-memory volumes then you may need to adjust the amount of memory allocated to your pod per the documentation - http://spark.apache.org/docs/latest/running-on-kubernetes.html#using-ram-for-local-storage > tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s > --- > > Key: SPARK-32259 > URL: https://issues.apache.org/jira/browse/SPARK-32259 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Prakash Rajendran >Priority: Blocker > Attachments: Capture.PNG > > > In Spark-Submit, I have these config > "{color:#4c9aff}*spark.kubernetes.local.dirs.tmpfs=true*{color}", still spark > is not pointing its spill data to SPARK_LOCAL_DIRS path. > K8s is evicting the pod due to error "{color:#de350b}*Pod ephemeral local > storage usage exceeds the total limit of containers.*{color}" > > We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod > logs for stack trace is not available. we have only pod events given in > attachment > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32292) Run only relevant builds in parallel at Github Actions
Hyukjin Kwon created SPARK-32292: Summary: Run only relevant builds in parallel at Github Actions Key: SPARK-32292 URL: https://issues.apache.org/jira/browse/SPARK-32292 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 3.1.0 Reporter: Hyukjin Kwon Jenkins already runs only relevant tests. Github Actions should also reuse and follow it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32253) Make readability better in the test result logs
[ https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156609#comment-17156609 ] Hyukjin Kwon commented on SPARK-32253: -- See also https://github.com/check-run-reporter/action > Make readability better in the test result logs > --- > > Key: SPARK-32253 > URL: https://issues.apache.org/jira/browse/SPARK-32253 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, the readability in the logs are not really good. For example, see > https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D > We should have a way to easily see the failed test cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32105) Refactor current script transform code
[ https://issues.apache.org/jira/browse/SPARK-32105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32105. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27983 [https://github.com/apache/spark/pull/27983] > Refactor current script transform code > -- > > Key: SPARK-32105 > URL: https://issues.apache.org/jira/browse/SPARK-32105 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32105) Refactor current script transform code
[ https://issues.apache.org/jira/browse/SPARK-32105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32105: --- Assignee: angerszhu > Refactor current script transform code > -- > > Key: SPARK-32105 > URL: https://issues.apache.org/jira/browse/SPARK-32105 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-30985: Description: SPARK_CONF_DIR hosts configuration files like, 1) spark-defaults.conf - containing all the spark properties. 2) log4j.properties - Logger configuration. 3) spark-env.sh - Environment variables to be setup at driver and executor. 4) core-site.xml - Hadoop related configuration. 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. 6) metrics.properties - Spark metrics. 7) Any user specific - library or framework specific configuration file. Traditionally, SPARK_CONF_DIR has been the home to all user specific configuration files. So this feature, will let the user specific configuration files be mounted on the driver and executor pods' SPARK_CONF_DIR. Please review the attached design doc, for more details. [Google docs link|https://bit.ly/spark-30985] was: SPARK_CONF_DIR hosts configuration files like, 1) spark-defaults.conf - containing all the spark properties. 2) log4j.properties - Logger configuration. 3) spark-env.sh - Environment variables to be setup at driver and executor. 4) core-site.xml - Hadoop related configuration. 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. 6) metrics.properties - Spark metrics. 7) Any user specific - library or framework specific configuration file. Traditionally, SPARK_CONF_DIR has been the home to all user specific configuration files. So this feature, will let the user specific configuration files be mounted on the driver and executor pods' SPARK_CONF_DIR. Please review the attached design doc, for more details. [https://docs.google.com/document/d/1DUmNqMz5ky55yfegdh4e_CeItM_nqtrglFqFxsTxeeA/edit?usp=sharing] > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [Google docs link|https://bit.ly/spark-30985] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32289) Chinese characters are garbled when opening csv files with Excel
[ https://issues.apache.org/jira/browse/SPARK-32289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156548#comment-17156548 ] angerszhu commented on SPARK-32289: --- [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4508058] > Chinese characters are garbled when opening csv files with Excel > > > Key: SPARK-32289 > URL: https://issues.apache.org/jira/browse/SPARK-32289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: garbled.png > > > How to reproduce this issue: > {code:scala} > spark.sql("SELECT '我爱中文' AS chinese").write.option("header", > "true").csv("/tmp/spark/csv") > {code} > !garbled.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join
[ https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32291: Attachment: coalesce.png > COALESCE should not reduce the child parallelism if it is Join > -- > > Key: SPARK-32291 > URL: https://issues.apache.org/jira/browse/SPARK-32291 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: COALESCE.png, coalesce.png, repartition.png > > > How to reproduce this issue: > {code:scala} > spark.range(100).createTempView("t1") > spark.range(200).createTempView("t2") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = > t2.id)").show > {code} > The dag is: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join
[ https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32291: Description: How to reproduce this issue: {code:scala} spark.range(100).createTempView("t1") spark.range(200).createTempView("t2") spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = t2.id)").show {code} The dag is: !COALESCE.png! A real case: !coalesce.png! !repartition.png! was: How to reproduce this issue: {code:scala} spark.range(100).createTempView("t1") spark.range(200).createTempView("t2") spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = t2.id)").show {code} The dag is: > COALESCE should not reduce the child parallelism if it is Join > -- > > Key: SPARK-32291 > URL: https://issues.apache.org/jira/browse/SPARK-32291 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: COALESCE.png, coalesce.png, repartition.png > > > How to reproduce this issue: > {code:scala} > spark.range(100).createTempView("t1") > spark.range(200).createTempView("t2") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = > t2.id)").show > {code} > The dag is: > !COALESCE.png! > A real case: > !coalesce.png! > !repartition.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join
[ https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32291: Attachment: repartition.png > COALESCE should not reduce the child parallelism if it is Join > -- > > Key: SPARK-32291 > URL: https://issues.apache.org/jira/browse/SPARK-32291 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: COALESCE.png, repartition.png > > > How to reproduce this issue: > {code:scala} > spark.range(100).createTempView("t1") > spark.range(200).createTempView("t2") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = > t2.id)").show > {code} > The dag is: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join
[ https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32291: Attachment: COALESCE.png > COALESCE should not reduce the child parallelism if it is Join > -- > > Key: SPARK-32291 > URL: https://issues.apache.org/jira/browse/SPARK-32291 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: COALESCE.png > > > How to reproduce this issue: > {code:scala} > spark.range(100).createTempView("t1") > spark.range(200).createTempView("t2") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = > t2.id)").show > {code} > The dag is: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join
Yuming Wang created SPARK-32291: --- Summary: COALESCE should not reduce the child parallelism if it is Join Key: SPARK-32291 URL: https://issues.apache.org/jira/browse/SPARK-32291 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang Attachments: COALESCE.png How to reproduce this issue: {code:scala} spark.range(100).createTempView("t1") spark.range(200).createTempView("t2") spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = t2.id)").show {code} The dag is: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32220) Cartesian Product Hint cause data error
[ https://issues.apache.org/jira/browse/SPARK-32220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156540#comment-17156540 ] Apache Spark commented on SPARK-32220: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/29084 > Cartesian Product Hint cause data error > --- > > Key: SPARK-32220 > URL: https://issues.apache.org/jira/browse/SPARK-32220 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Blocker > Labels: correctness > Fix For: 3.0.1, 3.1.0 > > > {code:java} > spark-sql> select * from test4 order by a asc; > 1 2 > Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO > SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s) > spark-sql>select * from test5 order by a asc > 1 2 > 2 2 > Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO > SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar > spark-sql>select /*+ shuffle_replicate_nl(test4) */ * from test4 join test5 > where test4.a = test5.a order by test4.a asc ; > 1 2 1 2 > 1 2 2 2 > Time taken: 0.351 seconds, Fetched 2 row(s) > 20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched > 2 row(s){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-30985: Component/s: (was: Spark Core) > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > Please review the attached design doc, for more details. > > [https://docs.google.com/document/d/1DUmNqMz5ky55yfegdh4e_CeItM_nqtrglFqFxsTxeeA/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32290) NotInSubquery SingleColumn Optimize
[ https://issues.apache.org/jira/browse/SPARK-32290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32290: Fix Version/s: 3.0.1 > NotInSubquery SingleColumn Optimize > --- > > Key: SPARK-32290 > URL: https://issues.apache.org/jira/browse/SPARK-32290 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Normally, > A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very > very time consuming. For example, I've done TPCH benchmark lately, Query 16 > almost took half of the entire TPCH 22Query execution Time. So i proposed > that to do the following optimize. > Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only > single column in following pattern. > {code:java} > case _@Or( > _@EqualTo(leftAttr: AttributeReference, rightAttr: > AttributeReference), > _@IsNull( > _@EqualTo(_: AttributeReference, _: AttributeReference) > ) > ) > {code} > if buildSide rows is small enough, we can change build side data into a > HashMap. > so the M*N calculation can be optimized into M*log(N) > I've done a benchmark job in 1TB TPCH, before apply the optimize > Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it > takes only 30s to finish. > But this optimize only works on single column not in subquery, so i am here > to seek advise whether the community need this update or not. I will do the > pull request first, if the community member thought it's hack, it's fine to > just ignore this request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32290) NotInSubquery SingleColumn Optimize
Leanken.Lin created SPARK-32290: --- Summary: NotInSubquery SingleColumn Optimize Key: SPARK-32290 URL: https://issues.apache.org/jira/browse/SPARK-32290 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin Fix For: 3.1.0 Normally, A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very very time consuming. For example, I've done TPCH benchmark lately, Query 16 almost took half of the entire TPCH 22Query execution Time. So i proposed that to do the following optimize. Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only single column in following pattern. {code:java} case _@Or( _@EqualTo(leftAttr: AttributeReference, rightAttr: AttributeReference), _@IsNull( _@EqualTo(_: AttributeReference, _: AttributeReference) ) ) {code} if buildSide rows is small enough, we can change build side data into a HashMap. so the M*N calculation can be optimized into M*log(N) I've done a benchmark job in 1TB TPCH, before apply the optimize Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it takes only 30s to finish. But this optimize only works on single column not in subquery, so i am here to seek advise whether the community need this update or not. I will do the pull request first, if the community member thought it's hack, it's fine to just ignore this request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org