date:20200713

[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2020-07-13 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157180#comment-17157180
 ] 

Cheng Su commented on SPARK-24528:
--

+1 for [~viirya]'s suggestion. I think some change on FileScanRDD and 
FileSourceScanExec should do the job to preserve ordering property when reading 
sorted bucketed files (for non-vectorization code path). Though we should 
selectively enable this feature as for each task we need to keep current row in 
task memory, for all buckets of this task. So I think we need to be careful to 
avoid merge too many bucket files and cause OOM on task. I am working on a PR 
now.

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ohad Raviv
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-24528#Closely related to  
> SPARK-24410, we're trying to optimize a very common use case we have of 
> getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32298) tree models prediction optimization

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157144#comment-17157144
 ] 

Apache Spark commented on SPARK-32298:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29095

> tree models prediction optimization
> ---
>
> Key: SPARK-32298
> URL: https://issues.apache.org/jira/browse/SPARK-32298
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> in {{Node}}'s method
>  
> {color:#e0957b}def {color}{color:#c7a65d}predictImpl{color}(features: 
> Vector): LeafNode
>  
> use while-loop instead of the recursive way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32298) tree models prediction optimization

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32298:


Assignee: Apache Spark

> tree models prediction optimization
> ---
>
> Key: SPARK-32298
> URL: https://issues.apache.org/jira/browse/SPARK-32298
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> in {{Node}}'s method
>  
> {color:#e0957b}def {color}{color:#c7a65d}predictImpl{color}(features: 
> Vector): LeafNode
>  
> use while-loop instead of the recursive way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32298) tree models prediction optimization

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32298:


Assignee: (was: Apache Spark)

> tree models prediction optimization
> ---
>
> Key: SPARK-32298
> URL: https://issues.apache.org/jira/browse/SPARK-32298
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> in {{Node}}'s method
>  
> {color:#e0957b}def {color}{color:#c7a65d}predictImpl{color}(features: 
> Vector): LeafNode
>  
> use while-loop instead of the recursive way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32298) tree models prediction optimization

2020-07-13 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-32298:


 Summary: tree models prediction optimization
 Key: SPARK-32298
 URL: https://issues.apache.org/jira/browse/SPARK-32298
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng


in {{Node}}'s method

 

{color:#e0957b}def {color}{color:#c7a65d}predictImpl{color}(features: Vector): 
LeafNode

 

use while-loop instead of the recursive way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32241) Remove empty children of union

2020-07-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32241:
---

Assignee: Peter Toth

> Remove empty children of union
> --
>
> Key: SPARK-32241
> URL: https://issues.apache.org/jira/browse/SPARK-32241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Minor
>
> Empty relation children of a union can be removed.
> e.g. the plan of
> {noformat}
> SELECT c FROM t UNION ALL SELECT c FROM t WHERE FALSE{noformat}
> is currently:
> {noformat}
> == Physical Plan ==
> Union
> :- *(1) Project [value#219 AS c#222]
> :  +- *(1) LocalTableScan [value#219]
> +- LocalTableScan , [c#224]{noformat}
> but it could be improved as: 
> {noformat}
> == Physical Plan ==
> *(1) Project [value#219 AS c#222]
> +- *(1) LocalTableScan [value#219]{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32241) Remove empty children of union

2020-07-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32241.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29053
[https://github.com/apache/spark/pull/29053]

> Remove empty children of union
> --
>
> Key: SPARK-32241
> URL: https://issues.apache.org/jira/browse/SPARK-32241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Minor
> Fix For: 3.1.0
>
>
> Empty relation children of a union can be removed.
> e.g. the plan of
> {noformat}
> SELECT c FROM t UNION ALL SELECT c FROM t WHERE FALSE{noformat}
> is currently:
> {noformat}
> == Physical Plan ==
> Union
> :- *(1) Project [value#219 AS c#222]
> :  +- *(1) LocalTableScan [value#219]
> +- LocalTableScan , [c#224]{noformat}
> but it could be improved as: 
> {noformat}
> == Physical Plan ==
> *(1) Project [value#219 AS c#222]
> +- *(1) LocalTableScan [value#219]{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157132#comment-17157132
 ] 

Apache Spark commented on SPARK-24983:
--

User 'constzhou' has created a pull request for this issue:
https://github.com/apache/spark/pull/29094

> Collapsing multiple project statements with dependent When-Otherwise 
> statements on the same column can OOM the driver
> -
>
> Key: SPARK-24983
> URL: https://issues.apache.org/jira/browse/SPARK-24983
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed that writing a spark job that includes many sequential 
> {{when-otherwise}} statements on the same column can easily OOM the driver 
> while generating the optimized plan because the project node will grow 
> exponentially in size.
> Example:
> {noformat}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> val df = Seq("a", "b", "c", "1").toDF("text")
> df: org.apache.spark.sql.DataFrame = [text: string]
> scala> var dfCaseWhen = df.filter($"text" =!= lit("0"))
> dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: 
> string]
> scala> for( a <- 1 to 5) {
>  | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === 
> lit(a.toString), lit("r" + a.toString)).otherwise($"text"))
>  | }
> scala> dfCaseWhen.queryExecution.analyzed
> res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14]
> +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12]
>+- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10]
>   +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8]
>  +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6]
> +- Filter NOT (text#3 = 0)
>+- Project [value#1 AS text#3]
>   +- LocalRelation [value#1]
> scala> dfCaseWhen.queryExecution.optimizedPlan
> res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
> THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 
> ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) 
> THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 
> ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN 
> (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 
> 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN 
> (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 
> 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE 
> WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va...
> {noformat}
> As one can see the optimized plan grows exponentially in the number of 
> {{when-otherwise}} statements here.
> I can see that this comes from the {{CollapseProject}} optimizer rule.
> Maybe we should put a limit on the resulting size of the project node after 
> collapsing and only collapse if we stay under the limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32266) Run smoke tests after a commit is pushed

2020-07-13 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157129#comment-17157129
 ] 

Gengliang Wang commented on SPARK-32266:


[~hyukjin.kwon]Thanks for the update.

> Run smoke tests after a commit is pushed
> 
>
> Key: SPARK-32266
> URL: https://issues.apache.org/jira/browse/SPARK-32266
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> Run linter/sbt build/maven build/doc generation on commit pushed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31356) Splitting Aggregate node into separate Aggregate and Serialize for Optimizer

2020-07-13 Thread Martin Loncaric (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157128#comment-17157128
 ] 

Martin Loncaric commented on SPARK-31356:
-

Actually, there seem to be 3 separate performance issues:
1. unnecessary appendColumns when groupByKey function just returns a subset of 
columns (though this is hard to get around in a type safe way)
2. unnecessary serialize + deserialize
3. actually the RDD's API is roughly a whole 2x faster. It seems there's a lot 
of room to improve aggregations

> Splitting Aggregate node into separate Aggregate and Serialize for Optimizer
> 
>
> Key: SPARK-31356
> URL: https://issues.apache.org/jira/browse/SPARK-31356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Martin Loncaric
>Priority: Major
>
> Problem: in Datasets API, it is a very common pattern to do something like 
> this whenever a complex reduce function is needed:
> {code:scala}
> ds
>   .groupByKey(_.y)
>   .reduceGroups((a, b) => {...})
>   .map(_._2)
> {code}
> However, the .map(_._2) step (taking values and throwing keys away) 
> unfortunately often ends up as an unnecessary serialization during 
> aggregation step, followed by {{DeserializeToObject + MapElements (from (K, 
> V) => V) + SerializeFromObject}} in the optimized logical plan. In this 
> example, it would be more ideal to either skip the 
> deserialization/serialization or {{Project (from (K, V) => V)}}. Even 
> manually doing a {{.select(...).as[T]}} to replace the `.map` is quite 
> tricky, because
> * the columns are complicated, like {{[value, 
> ReduceAggregator(my.data.type)]}}, and seem to be impossible to {{.select}}
> * it breaks the nice type checking of Datasets
> Proposal:
> Change the {{KeyValueGroupedDataset.aggUntyped}} method to (like 
> {{KeyValueGroupedDataset.cogroup}}) append add both an {{Aggregate node}} and 
> a {{SerializeFromObject}} node so that the Optimizer can eliminate the 
> serialization when it is redundant. Change aggregations to emit deserialized 
> results.
> I had 2 ideas for what we could change: either add a new feature to 
> {{.reduceGroupValues}} that projects to only the necessary columns, or do 
> this improvement. I thought this would be a better solution because
> * it will improve the performance of existing Spark applications with no 
> modifications
> * feature growth is undesirable
> Uncertainties:
> Affects Version: I'm not sure - if I submit a PR soon, can we get this into 
> 3.0? Or only 3.1? And I assume we're not adding new features to 2.4?
> Complications: Are there any hazards in splitting Aggregation into 
> Aggregation + SerializeFromObject that I'm not aware of?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32253) Make readability better in the test result logs

2020-07-13 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157120#comment-17157120
 ] 

L. C. Hsieh edited comment on SPARK-32253 at 7/14/20, 3:19 AM:
---

Looks interesting. Will do some tests. :)


was (Author: viirya):
Will do some tests. :)

> Make readability better in the test result logs
> ---
>
> Key: SPARK-32253
> URL: https://issues.apache.org/jira/browse/SPARK-32253
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, the readability in the logs are not really good. For example, see 
> https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D
> We should have a way to easily see the failed test cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32253) Make readability better in the test result logs

2020-07-13 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157120#comment-17157120
 ] 

L. C. Hsieh edited comment on SPARK-32253 at 7/14/20, 3:18 AM:
---

Will do some tests. :)


was (Author: viirya):
Will do some tests.

> Make readability better in the test result logs
> ---
>
> Key: SPARK-32253
> URL: https://issues.apache.org/jira/browse/SPARK-32253
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, the readability in the logs are not really good. For example, see 
> https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D
> We should have a way to easily see the failed test cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32253) Make readability better in the test result logs

2020-07-13 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157120#comment-17157120
 ] 

L. C. Hsieh commented on SPARK-32253:
-

Will do some tests.

> Make readability better in the test result logs
> ---
>
> Key: SPARK-32253
> URL: https://issues.apache.org/jira/browse/SPARK-32253
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, the readability in the logs are not really good. For example, see 
> https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D
> We should have a way to easily see the failed test cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32264) More resources in Github Actions

2020-07-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157100#comment-17157100
 ] 

Hyukjin Kwon commented on SPARK-32264:
--

This is in progress at the private mailing list.

> More resources in Github Actions
> 
>
> Key: SPARK-32264
> URL: https://issues.apache.org/jira/browse/SPARK-32264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We are currently using free version of Github Actions which only allows 20 
> concurrent jobs. This is not enough in the heavy development in Apache spark.
> We should have a way to allocate more resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32266) Run smoke tests after a commit is pushed

2020-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32266.
--
  Assignee: Dongjoon Hyun
Resolution: Fixed

> Run smoke tests after a commit is pushed
> 
>
> Key: SPARK-32266
> URL: https://issues.apache.org/jira/browse/SPARK-32266
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Run linter/sbt build/maven build/doc generation on commit pushed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32266) Run smoke tests after a commit is pushed

2020-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32266:
-
Fix Version/s: 3.1.0

> Run smoke tests after a commit is pushed
> 
>
> Key: SPARK-32266
> URL: https://issues.apache.org/jira/browse/SPARK-32266
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> Run linter/sbt build/maven build/doc generation on commit pushed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32266) Run smoke tests after a commit is pushed

2020-07-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157098#comment-17157098
 ] 

Hyukjin Kwon commented on SPARK-32266:
--

This was fixed in https://github.com/apache/spark/pull/29076

> Run smoke tests after a commit is pushed
> 
>
> Key: SPARK-32266
> URL: https://issues.apache.org/jira/browse/SPARK-32266
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Run linter/sbt build/maven build/doc generation on commit pushed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32253) Make readability better in the test result logs

2020-07-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157097#comment-17157097
 ] 

Hyukjin Kwon commented on SPARK-32253:
--

[~Gengliang.Wang] or probably [~viirya] from the watchers :-). Are you guys 
interested in this?
Testing way is pretty easy: you can just make a branch as usual but open a PR 
against your own master to test it.
That will automatically trigger the Github Actions build by using your account 
in your forked Spark repo.

> Make readability better in the test result logs
> ---
>
> Key: SPARK-32253
> URL: https://issues.apache.org/jira/browse/SPARK-32253
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, the readability in the logs are not really good. For example, see 
> https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D
> We should have a way to easily see the failed test cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32296) Flaky Test: submit a barrier ResultStage that requires more slots than current total under local-cluster mode

2020-07-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157086#comment-17157086
 ] 

Hyukjin Kwon commented on SPARK-32296:
--

cc [~jiangxb1987] FYI

> Flaky Test: submit a barrier ResultStage that requires more slots than 
> current total under local-cluster mode
> -
>
> Key: SPARK-32296
> URL: https://issues.apache.org/jira/browse/SPARK-32296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> 2020-07-13T21:39:28.3795362Z [0m[[0minfo[0m] [0m[31m- submit a barrier 
> ResultStage that requires more slots than current total under local-cluster 
> mode *** FAILED *** (5 seconds, 703 milliseconds)[0m[0m
> 2020-07-13T21:39:28.3843780Z [0m[[0minfo[0m] [0m[31m  Expected exception 
> org.apache.spark.SparkException to be thrown, but 
> java.util.concurrent.TimeoutException was thrown 
> (BarrierStageOnSubmittedSuite.scala:53)[0m[0m
> 2020-07-13T21:39:28.3844344Z [0m[[0minfo[0m] [0m[31m  
> org.scalatest.exceptions.TestFailedException:[0m[0m
> 2020-07-13T21:39:28.4058689Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)[0m[0m
> 2020-07-13T21:39:28.4059209Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)[0m[0m
> 2020-07-13T21:39:28.4175876Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)[0m[0m
> 2020-07-13T21:39:28.4176563Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Assertions.intercept(Assertions.scala:814)[0m[0m
> 2020-07-13T21:39:28.4176967Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Assertions.intercept$(Assertions.scala:804)[0m[0m
> 2020-07-13T21:39:28.4177353Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuite.intercept(FunSuite.scala:1560)[0m[0m
> 2020-07-13T21:39:28.4177794Z [0m[[0minfo[0m] [0m[31m  at 
> org.apache.spark.BarrierStageOnSubmittedSuite.testSubmitJob(BarrierStageOnSubmittedSuite.scala:53)[0m[0m
> 2020-07-13T21:39:28.4178272Z [0m[[0minfo[0m] [0m[31m  at 
> org.apache.spark.BarrierStageOnSubmittedSuite.$anonfun$new$35(BarrierStageOnSubmittedSuite.scala:240)[0m[0m
> 2020-07-13T21:39:28.4178695Z [0m[[0minfo[0m] [0m[31m  at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[0m[0m
> 2020-07-13T21:39:28.4179081Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[0m[0m
> 2020-07-13T21:39:28.4179731Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[0m[0m
> 2020-07-13T21:39:28.4180162Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[0m[0m
> 2020-07-13T21:39:28.4180550Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Transformer.apply(Transformer.scala:22)[0m[0m
> 2020-07-13T21:39:28.4180929Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Transformer.apply(Transformer.scala:20)[0m[0m
> 2020-07-13T21:39:28.4181323Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)[0m[0m
> 2020-07-13T21:39:28.4181728Z [0m[[0minfo[0m] [0m[31m  at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)[0m[0m
> 2020-07-13T21:39:28.4223205Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)[0m[0m
> 2020-07-13T21:39:28.4223689Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)[0m[0m
> 2020-07-13T21:39:28.4224119Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)[0m[0m
> 2020-07-13T21:39:28.4224510Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)[0m[0m
> 2020-07-13T21:39:28.4224901Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)[0m[0m
> 2020-07-13T21:39:28.4225362Z [0m[[0minfo[0m] [0m[31m  at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)[0m[0m
> 2020-07-13T21:39:28.4225778Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)[0m[0m
> 2020-07-13T21:39:28.4226188Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)[0m[0m
> 2020-07-13T21:39:28.4226589Z [0m[[0minfo[0m] [0m[31m  at 
> org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)[0m[0m
> 2020-07-13T21:39:28.4226997Z [0

[jira] [Created] (SPARK-32297) Flaky Test: YarnClusterSuite 4 test cases

2020-07-13 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-32297:


 Summary: Flaky Test: YarnClusterSuite 4 test cases
 Key: SPARK-32297
 URL: https://issues.apache.org/jira/browse/SPARK-32297
 Project: Spark
  Issue Type: Sub-task
  Components: Tests, YARN
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


{code}
2020-07-13T20:04:30.9911637Z [0m[[0minfo[0m] [0m[31m- run Spark in 
yarn-client mode with different configurations, ensuring redaction *** FAILED 
*** (3 minutes, 0 seconds)[0m[0m
2020-07-13T20:04:30.9912398Z [0m[[0minfo[0m] [0m[31m  The code passed to 
eventually never returned normally. Attempted 190 times over 3.001191441868 
minutes. Last failure message: handle.getState().isFinal() was false. 
(BaseYarnClusterSuite.scala:170)[0m[0m
2020-07-13T20:04:30.9931230Z [0m[[0minfo[0m] [0m[31m  
org.scalatest.exceptions.TestFailedDueToTimeoutException:[0m[0m
2020-07-13T20:04:30.9932756Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)[0m[0m
2020-07-13T20:04:30.9933210Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)[0m[0m
2020-07-13T20:04:30.9933633Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)[0m[0m
2020-07-13T20:04:30.9934024Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)[0m[0m
2020-07-13T20:04:30.9934430Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.concurrent.Eventually.eventually(Eventually.scala:308)[0m[0m
2020-07-13T20:04:30.9934824Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:307)[0m[0m
2020-07-13T20:04:30.9935218Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)[0m[0m
2020-07-13T20:04:30.9935655Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.deploy.yarn.BaseYarnClusterSuite.runSpark(BaseYarnClusterSuite.scala:170)[0m[0m
2020-07-13T20:04:31.0012081Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.deploy.yarn.YarnClusterSuite.testBasicYarnApp(YarnClusterSuite.scala:243)[0m[0m
2020-07-13T20:04:31.0013838Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.deploy.yarn.YarnClusterSuite.$anonfun$new$4(YarnClusterSuite.scala:104)[0m[0m
2020-07-13T20:04:31.0015078Z [0m[[0minfo[0m] [0m[31m  at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[0m[0m
2020-07-13T20:04:31.0015899Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[0m[0m
2020-07-13T20:04:31.0016423Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[0m[0m
2020-07-13T20:04:31.0016952Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[0m[0m
2020-07-13T20:04:31.0017479Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.Transformer.apply(Transformer.scala:22)[0m[0m
2020-07-13T20:04:31.0018599Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.Transformer.apply(Transformer.scala:20)[0m[0m
2020-07-13T20:04:31.0019144Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)[0m[0m
2020-07-13T20:04:31.0019692Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)[0m[0m
2020-07-13T20:04:31.0020230Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)[0m[0m
2020-07-13T20:04:31.0020789Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)[0m[0m
2020-07-13T20:04:31.0021285Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)[0m[0m
2020-07-13T20:04:31.0021826Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)[0m[0m
2020-07-13T20:04:31.0022361Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)[0m[0m
2020-07-13T20:04:31.0022913Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)[0m[0m
2020-07-13T20:04:31.0023470Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)[0m[0m
2020-07-13T20:04:31.0024015Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)[0m[0m
2020-07-13T20:04:31.0024534Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)[0m[0m
2020-07-13T20:04:31.0025078Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)[0m[0m
2020-07-13T20:04:31.0025606Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.SuperEng

[jira] [Resolved] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation

2020-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32138.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28957
[https://github.com/apache/spark/pull/28957]

> Drop Python 2, 3.4 and 3.5 in codes and documentation
> -
>
> Key: SPARK-32138
> URL: https://issues.apache.org/jira/browse/SPARK-32138
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation

2020-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32138:


Assignee: Hyukjin Kwon

> Drop Python 2, 3.4 and 3.5 in codes and documentation
> -
>
> Key: SPARK-32138
> URL: https://issues.apache.org/jira/browse/SPARK-32138
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32278) Install PyPy3 on Jenkins to enable PySpark tests with PyPy

2020-07-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157080#comment-17157080
 ] 

Hyukjin Kwon commented on SPARK-32278:
--

Oh, yeah. I noticed this, and forgot to take an action to this JIRA. Sorry for 
a false alarm - this JIRA can be resolved.

> Install PyPy3 on Jenkins to enable PySpark tests with PyPy
> --
>
> Key: SPARK-32278
> URL: https://issues.apache.org/jira/browse/SPARK-32278
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Current PyPy installed in Jenkins is too old, which is Python 2 compatible. 
> Python 2 will be dropped at SPARK-32138, and we should now upgrade PyPy to 
> Python 3 compatible PyPy 3.
> See also:
> https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R160
> https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/python/run-tests.py#L160



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32278) Install PyPy3 on Jenkins to enable PySpark tests with PyPy

2020-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32278.
--
Resolution: Not A Problem

> Install PyPy3 on Jenkins to enable PySpark tests with PyPy
> --
>
> Key: SPARK-32278
> URL: https://issues.apache.org/jira/browse/SPARK-32278
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Current PyPy installed in Jenkins is too old, which is Python 2 compatible. 
> Python 2 will be dropped at SPARK-32138, and we should now upgrade PyPy to 
> Python 3 compatible PyPy 3.
> See also:
> https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R160
> https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/python/run-tests.py#L160



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32279) Install Sphinx in Python 3 on Jenkins machines

2020-07-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157078#comment-17157078
 ] 

Hyukjin Kwon commented on SPARK-32279:
--

I believe any version is fine. Probably the latest one :-).

> Install Sphinx in Python 3 on Jenkins machines
> --
>
> Key: SPARK-32279
> URL: https://issues.apache.org/jira/browse/SPARK-32279
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Currently Sphinx is only installed in Python 2. We should install it in 
> Python 3 and test it in Jenkins as Python 2, 3.4 and 3.5 were dropped at 
> SPARK-32138.
> See also:
> https://github.com/apache/spark/pull/28957/files#diff-ccd847a0316575dde31bd89786bbe1f2R176
> https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/dev/lint-python#L176



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32146) ValueError when loading a PipelineModel on a personal computer

2020-07-13 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-32146.
--
Resolution: Invalid

Please use user mailing list regarding question. If your issue is bound to the 
specific vendor, please go through support line on the vendor.

> ValueError when loading a PipelineModel on a personal computer
> --
>
> Key: SPARK-32146
> URL: https://issues.apache.org/jira/browse/SPARK-32146
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.5
> Environment: * OS: Windows
>  * SparkSession: spark = 
> SparkSession.builder.appName({color:#6a8759}"annonces_organiques"{color}).getOrCreate()
>Reporter: LoicH
>Priority: Major
>
> I have a PipelineModel saved on my computer that I can't load using 
> {{PipelineModel.load(path)}}.
> When I launch my code in a Databricks cluster, it works. {{path}} is the path 
> to my model saved on DBFS, accessible via a mount point: {{path = 
> "/dbfs/path/to/my/model}}.
> However on my machine, calling 
> {{PipelineModel.load("C:\\Users\\path\\to\\my\\model")}} throws a 
> {{ValueError("RDD is empty")}}.
> Here is how the model is saved on my computer:
> {code:title=pipeline.txt}
> \---model
> +---metadata
> |   part-0
> |   _SUCCESS
> |
> \---stages
> +---0_CountVectorizer_b92625354bf7
> |   +---data
> |   |   
> part-0-tid-9156766819779394023-5cf6aecb-8959-48b3-be24-65bfa0543465-62-1-c000.snappy.parquet
> |   |   _committed_9156766819779394023
> |   |   _started_9156766819779394023
> |   |   _SUCCESS
> |   |
> |   \---metadata
> |   part-0
> |   _SUCCESS
> |
> \---1_LinearSVC_108fa01daf43
> +---data
> |   
> part-0-tid-4403060754466700849-27841dd9-de88-4015-9dfa-7854c2a15f15-65-1-c000.snappy.parquet
> |   _committed_4403060754466700849
> |   _started_4403060754466700849
> |   _SUCCESS
> |
> \---metadata
> part-0
> _SUCCESS
> {code}
> (I just downloaded the model from my DataLake to my computer)
> How can I load this model when running my code in local?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32146) ValueError when loading a PipelineModel on a personal computer

2020-07-13 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-32146:
-
Priority: Major  (was: Blocker)

> ValueError when loading a PipelineModel on a personal computer
> --
>
> Key: SPARK-32146
> URL: https://issues.apache.org/jira/browse/SPARK-32146
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.5
> Environment: * OS: Windows
>  * SparkSession: spark = 
> SparkSession.builder.appName({color:#6a8759}"annonces_organiques"{color}).getOrCreate()
>Reporter: LoicH
>Priority: Major
>
> I have a PipelineModel saved on my computer that I can't load using 
> {{PipelineModel.load(path)}}.
> When I launch my code in a Databricks cluster, it works. {{path}} is the path 
> to my model saved on DBFS, accessible via a mount point: {{path = 
> "/dbfs/path/to/my/model}}.
> However on my machine, calling 
> {{PipelineModel.load("C:\\Users\\path\\to\\my\\model")}} throws a 
> {{ValueError("RDD is empty")}}.
> Here is how the model is saved on my computer:
> {code:title=pipeline.txt}
> \---model
> +---metadata
> |   part-0
> |   _SUCCESS
> |
> \---stages
> +---0_CountVectorizer_b92625354bf7
> |   +---data
> |   |   
> part-0-tid-9156766819779394023-5cf6aecb-8959-48b3-be24-65bfa0543465-62-1-c000.snappy.parquet
> |   |   _committed_9156766819779394023
> |   |   _started_9156766819779394023
> |   |   _SUCCESS
> |   |
> |   \---metadata
> |   part-0
> |   _SUCCESS
> |
> \---1_LinearSVC_108fa01daf43
> +---data
> |   
> part-0-tid-4403060754466700849-27841dd9-de88-4015-9dfa-7854c2a15f15-65-1-c000.snappy.parquet
> |   _committed_4403060754466700849
> |   _started_4403060754466700849
> |   _SUCCESS
> |
> \---metadata
> part-0
> _SUCCESS
> {code}
> (I just downloaded the model from my DataLake to my computer)
> How can I load this model when running my code in local?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32259) tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s

2020-07-13 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157069#comment-17157069
 ] 

Jungtaek Lim commented on SPARK-32259:
--

Lowering the priority, as Critical+ requires committer's judgement.

> tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s
> ---
>
> Key: SPARK-32259
> URL: https://issues.apache.org/jira/browse/SPARK-32259
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prakash Rajendran
>Priority: Major
> Attachments: Capture.PNG
>
>
> In Spark-Submit, I have these config 
> "{color:#4c9aff}*spark.kubernetes.local.dirs.tmpfs=true*{color}", still spark 
> is not pointing its spill data to SPARK_LOCAL_DIRS path.
> K8s is evicting the pod due to error "{color:#de350b}*Pod ephemeral local 
> storage usage exceeds the total limit of containers.*{color}"
>  
> We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod 
> logs for stack trace is not available. we have only pod events given in 
> attachment
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32197) 'Spark driver' stays running even though 'spark application' has FAILED

2020-07-13 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157070#comment-17157070
 ] 

Jungtaek Lim commented on SPARK-32197:
--

Lowering the priority, as Critical+ requires committer's judgement.

> 'Spark driver' stays running even though 'spark application' has FAILED
> ---
>
> Key: SPARK-32197
> URL: https://issues.apache.org/jira/browse/SPARK-32197
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.6
>Reporter: t oo
>Priority: Major
> Attachments: app_executors.png, applog.txt, driverlog.txt, 
> failed1.png, failed_stages.png, failedapp.png, j1.out, stuckdriver.png
>
>
> App failed in 6 minutes, driver has been stuck for > 8 hours. I would expect 
> driver to fail if app fails.
>  
> Thread dump from jstack (on the driver pid) attached (j1.out)
> Last part of stdout driver log attached (full log is 23MB, stderr log just 
> has launch command)
> Last part of app logs attached
>  
> Can see that "org.apache.spark.util.ShutdownHookManager - Shutdown hook 
> called"  line never appears in the driver log after 
> "org.apache.spark.SparkContext - Successfully stopped SparkContext"
>  
> Using spark 2.4.6 with spark standalone mode. spark-submit to REST API (port 
> 6066) in cluster mode was used. Other drivers/apps have worked fine with this 
> setup, just this one getting stuck. My cluster has 1 EC2 dedicated as spark 
> master and 1 Spot EC2 dedicated as spark worker. They can auto heal/spot 
> terminate at any time. From checking aws logs: the worker was terminated at 
> 01:53:38
>  
> I think you can replicate this by tearing down worker machine while app is 
> running. You might have to try several times.
>  
> Similar to https://issues.apache.org/jira/browse/SPARK-24617 i raised before!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32197) 'Spark driver' stays running even though 'spark application' has FAILED

2020-07-13 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-32197:
-
Priority: Major  (was: Blocker)

> 'Spark driver' stays running even though 'spark application' has FAILED
> ---
>
> Key: SPARK-32197
> URL: https://issues.apache.org/jira/browse/SPARK-32197
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.6
>Reporter: t oo
>Priority: Major
> Attachments: app_executors.png, applog.txt, driverlog.txt, 
> failed1.png, failed_stages.png, failedapp.png, j1.out, stuckdriver.png
>
>
> App failed in 6 minutes, driver has been stuck for > 8 hours. I would expect 
> driver to fail if app fails.
>  
> Thread dump from jstack (on the driver pid) attached (j1.out)
> Last part of stdout driver log attached (full log is 23MB, stderr log just 
> has launch command)
> Last part of app logs attached
>  
> Can see that "org.apache.spark.util.ShutdownHookManager - Shutdown hook 
> called"  line never appears in the driver log after 
> "org.apache.spark.SparkContext - Successfully stopped SparkContext"
>  
> Using spark 2.4.6 with spark standalone mode. spark-submit to REST API (port 
> 6066) in cluster mode was used. Other drivers/apps have worked fine with this 
> setup, just this one getting stuck. My cluster has 1 EC2 dedicated as spark 
> master and 1 Spot EC2 dedicated as spark worker. They can auto heal/spot 
> terminate at any time. From checking aws logs: the worker was terminated at 
> 01:53:38
>  
> I think you can replicate this by tearing down worker machine while app is 
> running. You might have to try several times.
>  
> Similar to https://issues.apache.org/jira/browse/SPARK-24617 i raised before!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32259) tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s

2020-07-13 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-32259:
-
Priority: Major  (was: Blocker)

> tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s
> ---
>
> Key: SPARK-32259
> URL: https://issues.apache.org/jira/browse/SPARK-32259
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prakash Rajendran
>Priority: Major
> Attachments: Capture.PNG
>
>
> In Spark-Submit, I have these config 
> "{color:#4c9aff}*spark.kubernetes.local.dirs.tmpfs=true*{color}", still spark 
> is not pointing its spill data to SPARK_LOCAL_DIRS path.
> K8s is evicting the pod due to error "{color:#de350b}*Pod ephemeral local 
> storage usage exceeds the total limit of containers.*{color}"
>  
> We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod 
> logs for stack trace is not available. we have only pod events given in 
> attachment
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32220) Cartesian Product Hint cause data error

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157066#comment-17157066
 ] 

Apache Spark commented on SPARK-32220:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29093

> Cartesian Product Hint cause data error
> ---
>
> Key: SPARK-32220
> URL: https://issues.apache.org/jira/browse/SPARK-32220
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.1, 3.1.0
>
>
> {code:java}
> spark-sql> select * from test4 order by a asc;
> 1 2
> Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO 
> SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s)
> spark-sql>select * from test5 order by a asc
> 1 2
> 2 2
> Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO 
> SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar
> spark-sql>select /*+ shuffle_replicate_nl(test4) */ * from test4 join test5 
> where test4.a = test5.a order by test4.a asc ;
> 1 2 1 2
> 1 2 2 2
> Time taken: 0.351 seconds, Fetched 2 row(s)
> 20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched 
> 2 row(s){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

2020-07-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157065#comment-17157065
 ] 

Hyukjin Kwon commented on SPARK-32294:
--

Thanks for filing the issue, [~Tagar].

> GroupedData Pandas UDF 2Gb limit
> 
>
> Key: SPARK-32294
> URL: https://issues.apache.org/jira/browse/SPARK-32294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
> GroupedData, the whole group is passed to Pandas UDF at once, which can cause 
> various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
> 2Gb limitation on Netty allocator side) - 
> https://issues.apache.org/jira/browse/ARROW-4890 
> Would be great to consider feeding GroupedData into a pandas UDF in batches 
> to solve this issue. 
> cc [~hyukjin.kwon] 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32296) Flaky Test: submit a barrier ResultStage that requires more slots than current total under local-cluster mode

2020-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32296:
-
Component/s: Spark Core

> Flaky Test: submit a barrier ResultStage that requires more slots than 
> current total under local-cluster mode
> -
>
> Key: SPARK-32296
> URL: https://issues.apache.org/jira/browse/SPARK-32296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> 2020-07-13T21:39:28.3795362Z [0m[[0minfo[0m] [0m[31m- submit a barrier 
> ResultStage that requires more slots than current total under local-cluster 
> mode *** FAILED *** (5 seconds, 703 milliseconds)[0m[0m
> 2020-07-13T21:39:28.3843780Z [0m[[0minfo[0m] [0m[31m  Expected exception 
> org.apache.spark.SparkException to be thrown, but 
> java.util.concurrent.TimeoutException was thrown 
> (BarrierStageOnSubmittedSuite.scala:53)[0m[0m
> 2020-07-13T21:39:28.3844344Z [0m[[0minfo[0m] [0m[31m  
> org.scalatest.exceptions.TestFailedException:[0m[0m
> 2020-07-13T21:39:28.4058689Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)[0m[0m
> 2020-07-13T21:39:28.4059209Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)[0m[0m
> 2020-07-13T21:39:28.4175876Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)[0m[0m
> 2020-07-13T21:39:28.4176563Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Assertions.intercept(Assertions.scala:814)[0m[0m
> 2020-07-13T21:39:28.4176967Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Assertions.intercept$(Assertions.scala:804)[0m[0m
> 2020-07-13T21:39:28.4177353Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuite.intercept(FunSuite.scala:1560)[0m[0m
> 2020-07-13T21:39:28.4177794Z [0m[[0minfo[0m] [0m[31m  at 
> org.apache.spark.BarrierStageOnSubmittedSuite.testSubmitJob(BarrierStageOnSubmittedSuite.scala:53)[0m[0m
> 2020-07-13T21:39:28.4178272Z [0m[[0minfo[0m] [0m[31m  at 
> org.apache.spark.BarrierStageOnSubmittedSuite.$anonfun$new$35(BarrierStageOnSubmittedSuite.scala:240)[0m[0m
> 2020-07-13T21:39:28.4178695Z [0m[[0minfo[0m] [0m[31m  at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[0m[0m
> 2020-07-13T21:39:28.4179081Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[0m[0m
> 2020-07-13T21:39:28.4179731Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[0m[0m
> 2020-07-13T21:39:28.4180162Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[0m[0m
> 2020-07-13T21:39:28.4180550Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Transformer.apply(Transformer.scala:22)[0m[0m
> 2020-07-13T21:39:28.4180929Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.Transformer.apply(Transformer.scala:20)[0m[0m
> 2020-07-13T21:39:28.4181323Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)[0m[0m
> 2020-07-13T21:39:28.4181728Z [0m[[0minfo[0m] [0m[31m  at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)[0m[0m
> 2020-07-13T21:39:28.4223205Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)[0m[0m
> 2020-07-13T21:39:28.4223689Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)[0m[0m
> 2020-07-13T21:39:28.4224119Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)[0m[0m
> 2020-07-13T21:39:28.4224510Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)[0m[0m
> 2020-07-13T21:39:28.4224901Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)[0m[0m
> 2020-07-13T21:39:28.4225362Z [0m[[0minfo[0m] [0m[31m  at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)[0m[0m
> 2020-07-13T21:39:28.4225778Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)[0m[0m
> 2020-07-13T21:39:28.4226188Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)[0m[0m
> 2020-07-13T21:39:28.4226589Z [0m[[0minfo[0m] [0m[31m  at 
> org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)[0m[0m
> 2020-07-13T21:39:28.4226997Z [0m[[0minfo[0m] [0m[31m  at 
> org.scalatest.FunSu

[jira] [Created] (SPARK-32296) Flaky Test: submit a barrier ResultStage that requires more slots than current total under local-cluster mode

2020-07-13 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-32296:


 Summary: Flaky Test: submit a barrier ResultStage that requires 
more slots than current total under local-cluster mode
 Key: SPARK-32296
 URL: https://issues.apache.org/jira/browse/SPARK-32296
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


{code}
2020-07-13T21:39:28.3795362Z [0m[[0minfo[0m] [0m[31m- submit a barrier 
ResultStage that requires more slots than current total under local-cluster 
mode *** FAILED *** (5 seconds, 703 milliseconds)[0m[0m
2020-07-13T21:39:28.3843780Z [0m[[0minfo[0m] [0m[31m  Expected exception 
org.apache.spark.SparkException to be thrown, but 
java.util.concurrent.TimeoutException was thrown 
(BarrierStageOnSubmittedSuite.scala:53)[0m[0m
2020-07-13T21:39:28.3844344Z [0m[[0minfo[0m] [0m[31m  
org.scalatest.exceptions.TestFailedException:[0m[0m
2020-07-13T21:39:28.4058689Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)[0m[0m
2020-07-13T21:39:28.4059209Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)[0m[0m
2020-07-13T21:39:28.4175876Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)[0m[0m
2020-07-13T21:39:28.4176563Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.Assertions.intercept(Assertions.scala:814)[0m[0m
2020-07-13T21:39:28.4176967Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.Assertions.intercept$(Assertions.scala:804)[0m[0m
2020-07-13T21:39:28.4177353Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuite.intercept(FunSuite.scala:1560)[0m[0m
2020-07-13T21:39:28.4177794Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.BarrierStageOnSubmittedSuite.testSubmitJob(BarrierStageOnSubmittedSuite.scala:53)[0m[0m
2020-07-13T21:39:28.4178272Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.BarrierStageOnSubmittedSuite.$anonfun$new$35(BarrierStageOnSubmittedSuite.scala:240)[0m[0m
2020-07-13T21:39:28.4178695Z [0m[[0minfo[0m] [0m[31m  at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[0m[0m
2020-07-13T21:39:28.4179081Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[0m[0m
2020-07-13T21:39:28.4179731Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[0m[0m
2020-07-13T21:39:28.4180162Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[0m[0m
2020-07-13T21:39:28.4180550Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.Transformer.apply(Transformer.scala:22)[0m[0m
2020-07-13T21:39:28.4180929Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.Transformer.apply(Transformer.scala:20)[0m[0m
2020-07-13T21:39:28.4181323Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)[0m[0m
2020-07-13T21:39:28.4181728Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)[0m[0m
2020-07-13T21:39:28.4223205Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)[0m[0m
2020-07-13T21:39:28.4223689Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)[0m[0m
2020-07-13T21:39:28.4224119Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)[0m[0m
2020-07-13T21:39:28.4224510Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)[0m[0m
2020-07-13T21:39:28.4224901Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)[0m[0m
2020-07-13T21:39:28.4225362Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)[0m[0m
2020-07-13T21:39:28.4225778Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)[0m[0m
2020-07-13T21:39:28.4226188Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)[0m[0m
2020-07-13T21:39:28.4226589Z [0m[[0minfo[0m] [0m[31m  at 
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)[0m[0m
2020-07-13T21:39:28.4226997Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)[0m[0m
2020-07-13T21:39:28.4227685Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)[0m[0m
2020-07-13T21:39:28.4228069Z [0m[[0minfo[0m] [0m[31m  at 
scala.collection.immutable.List.foreach(List.scala:392)[0m[0m
2020-07-13T21:39:28.4228461Z [0m[[0minfo[0m] [0m[31m  at 
org.scalatest.SuperEngine.trav

[jira] [Assigned] (SPARK-32292) Run only relevant builds in parallel at Github Actions

2020-07-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32292:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Run only relevant builds in parallel at Github Actions
> --
>
> Key: SPARK-32292
> URL: https://issues.apache.org/jira/browse/SPARK-32292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Jenkins already runs only relevant tests. Github Actions should also reuse 
> and follow it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32004) Drop references to slave

2020-07-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-32004.
--
Fix Version/s: 3.1.0
 Assignee: Holden Karau
   Resolution: Fixed

> Drop references to slave
> 
>
> Key: SPARK-32004
> URL: https://issues.apache.org/jira/browse/SPARK-32004
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>
> We have a lot of references to "slave" in the code base which doesn't match 
> the terminology in the rest of our code base and we should clean it up. In 
> many situations it would be clearer with "executor", "worker", or "replica" 
> depending on the context (so this is not just a search and replace but 
> actually read through the code and make it consistent).
>  
> We may want to (in a follow on) explore renaming master to something more 
> precise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156976#comment-17156976
 ] 

Apache Spark commented on SPARK-32295:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29092

> Add not null and size > 0 filters before inner explode to benefit from 
> predicate pushdown
> -
>
> Key: SPARK-32295
> URL: https://issues.apache.org/jira/browse/SPARK-32295
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: performance
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156975#comment-17156975
 ] 

Apache Spark commented on SPARK-32295:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29092

> Add not null and size > 0 filters before inner explode to benefit from 
> predicate pushdown
> -
>
> Key: SPARK-32295
> URL: https://issues.apache.org/jira/browse/SPARK-32295
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: performance
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32295:


Assignee: Apache Spark

> Add not null and size > 0 filters before inner explode to benefit from 
> predicate pushdown
> -
>
> Key: SPARK-32295
> URL: https://issues.apache.org/jira/browse/SPARK-32295
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Major
>  Labels: performance
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32295:


Assignee: (was: Apache Spark)

> Add not null and size > 0 filters before inner explode to benefit from 
> predicate pushdown
> -
>
> Key: SPARK-32295
> URL: https://issues.apache.org/jira/browse/SPARK-32295
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: performance
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-07-13 Thread Tanel Kiis (Jira)

Tanel Kiis created SPARK-32295:
--

 Summary: Add not null and size > 0 filters before inner explode to 
benefit from predicate pushdown
 Key: SPARK-32295
 URL: https://issues.apache.org/jira/browse/SPARK-32295
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer, SQL
Affects Versions: 3.1.0
Reporter: Tanel Kiis






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables

2020-07-13 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32234:

Target Version/s: 3.0.1

> Spark sql commands are failing on select Queries for the  orc tables
> 
>
> Key: SPARK-32234
> URL: https://issues.apache.org/jira/browse/SPARK-32234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Priority: Blocker
>
> Spark sql commands are failing on select Queries for the orc tables
> Steps to reproduce
>  
> {code:java}
> val table = """CREATE TABLE `date_dim` (
>   `d_date_sk` INT,
>   `d_date_id` STRING,
>   `d_date` TIMESTAMP,
>   `d_month_seq` INT,
>   `d_week_seq` INT,
>   `d_quarter_seq` INT,
>   `d_year` INT,
>   `d_dow` INT,
>   `d_moy` INT,
>   `d_dom` INT,
>   `d_qoy` INT,
>   `d_fy_year` INT,
>   `d_fy_quarter_seq` INT,
>   `d_fy_week_seq` INT,
>   `d_day_name` STRING,
>   `d_quarter_name` STRING,
>   `d_holiday` STRING,
>   `d_weekend` STRING,
>   `d_following_holiday` STRING,
>   `d_first_dom` INT,
>   `d_last_dom` INT,
>   `d_same_day_ly` INT,
>   `d_same_day_lq` INT,
>   `d_current_day` STRING,
>   `d_current_week` STRING,
>   `d_current_month` STRING,
>   `d_current_quarter` STRING,
>   `d_current_year` STRING)
> USING orc
> LOCATION '/Users/test/tpcds_scale5data/date_dim'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1574682806')"""
> spark.sql(table).collect
> val u = """select date_dim.d_date_id from date_dim limit 5"""
> spark.sql(u).collect
> {code}
>  
>  
> Exception
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, 192.168.0.103, executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:133)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  
> The reason behind this initBatch is not getting the schema that is needed to 
> find out the column value in OrcFileFormat.scala
>  
> {code:java}
> batchReader.initBatch(
>  TypeDescription.fromString(resultSchemaString){code}
>  
> Query is working if 
> {code:java}
> val u = """select * from date_dim limit 5"""{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables

2020-07-13 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32234:

Priority: Blocker  (was: Major)

> Spark sql commands are failing on select Queries for the  orc tables
> 
>
> Key: SPARK-32234
> URL: https://issues.apache.org/jira/browse/SPARK-32234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Priority: Blocker
>
> Spark sql commands are failing on select Queries for the orc tables
> Steps to reproduce
>  
> {code:java}
> val table = """CREATE TABLE `date_dim` (
>   `d_date_sk` INT,
>   `d_date_id` STRING,
>   `d_date` TIMESTAMP,
>   `d_month_seq` INT,
>   `d_week_seq` INT,
>   `d_quarter_seq` INT,
>   `d_year` INT,
>   `d_dow` INT,
>   `d_moy` INT,
>   `d_dom` INT,
>   `d_qoy` INT,
>   `d_fy_year` INT,
>   `d_fy_quarter_seq` INT,
>   `d_fy_week_seq` INT,
>   `d_day_name` STRING,
>   `d_quarter_name` STRING,
>   `d_holiday` STRING,
>   `d_weekend` STRING,
>   `d_following_holiday` STRING,
>   `d_first_dom` INT,
>   `d_last_dom` INT,
>   `d_same_day_ly` INT,
>   `d_same_day_lq` INT,
>   `d_current_day` STRING,
>   `d_current_week` STRING,
>   `d_current_month` STRING,
>   `d_current_quarter` STRING,
>   `d_current_year` STRING)
> USING orc
> LOCATION '/Users/test/tpcds_scale5data/date_dim'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1574682806')"""
> spark.sql(table).collect
> val u = """select date_dim.d_date_id from date_dim limit 5"""
> spark.sql(u).collect
> {code}
>  
>  
> Exception
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, 192.168.0.103, executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:133)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  
> The reason behind this initBatch is not getting the schema that is needed to 
> find out the column value in OrcFileFormat.scala
>  
> {code:java}
> batchReader.initBatch(
>  TypeDescription.fromString(resultSchemaString){code}
>  
> Query is working if 
> {code:java}
> val u = """select * from date_dim limit 5"""{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32258) NormalizeFloatingNumbers directly normalizes IF/CaseWhen/Coalesce child expressions

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156939#comment-17156939
 ] 

Apache Spark commented on SPARK-32258:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29091

> NormalizeFloatingNumbers directly normalizes IF/CaseWhen/Coalesce child 
> expressions
> ---
>
> Key: SPARK-32258
> URL: https://issues.apache.org/jira/browse/SPARK-32258
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently NormalizeFloatingNumbers rule treats some expressions as black box 
> but we can optimize it a bit by normalizing directly the inner children 
> expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32258) NormalizeFloatingNumbers directly normalizes IF/CaseWhen/Coalesce child expressions

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156938#comment-17156938
 ] 

Apache Spark commented on SPARK-32258:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29091

> NormalizeFloatingNumbers directly normalizes IF/CaseWhen/Coalesce child 
> expressions
> ---
>
> Key: SPARK-32258
> URL: https://issues.apache.org/jira/browse/SPARK-32258
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently NormalizeFloatingNumbers rule treats some expressions as black box 
> but we can optimize it a bit by normalizing directly the inner children 
> expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32293:


Assignee: (was: Apache Spark)

> Inconsistent default unit between Spark memory configs and JVM option
> -
>
> Key: SPARK-32293
> URL: https://issues.apache.org/jira/browse/SPARK-32293
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 
> 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Spark's maximum memory can be configured in several ways:
> - via Spark config
> - command line argument
> - environment variables 
> Both for executors and for the driver the memory can be configured 
> separately. All of these are following the format of JVM memory 
> configurations in a way they are using the very same size unit suffixes ("k", 
> "m", "g" or "t") but there is an inconsistency regarding the default unit. 
> When no suffix is given then the given amount is passed as it is to the JVM 
> (to the -Xmx and -Xms options) where this memory options are using bytes as a 
> default unit, for this please see the example 
> [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]:
> {noformat}
> The following examples show how to set the maximum allowed size of allocated 
> memory to 80 MB using various units:
> -Xmx83886080 
> -Xmx81920k 
> -Xmx80m
> {noformat}
> Although the Spark memory config default suffix unit is "m".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32293:


Assignee: Apache Spark

> Inconsistent default unit between Spark memory configs and JVM option
> -
>
> Key: SPARK-32293
> URL: https://issues.apache.org/jira/browse/SPARK-32293
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 
> 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> Spark's maximum memory can be configured in several ways:
> - via Spark config
> - command line argument
> - environment variables 
> Both for executors and for the driver the memory can be configured 
> separately. All of these are following the format of JVM memory 
> configurations in a way they are using the very same size unit suffixes ("k", 
> "m", "g" or "t") but there is an inconsistency regarding the default unit. 
> When no suffix is given then the given amount is passed as it is to the JVM 
> (to the -Xmx and -Xms options) where this memory options are using bytes as a 
> default unit, for this please see the example 
> [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]:
> {noformat}
> The following examples show how to set the maximum allowed size of allocated 
> memory to 80 MB using various units:
> -Xmx83886080 
> -Xmx81920k 
> -Xmx80m
> {noformat}
> Although the Spark memory config default suffix unit is "m".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156917#comment-17156917
 ] 

Apache Spark commented on SPARK-32293:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/29090

> Inconsistent default unit between Spark memory configs and JVM option
> -
>
> Key: SPARK-32293
> URL: https://issues.apache.org/jira/browse/SPARK-32293
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 
> 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Spark's maximum memory can be configured in several ways:
> - via Spark config
> - command line argument
> - environment variables 
> Both for executors and for the driver the memory can be configured 
> separately. All of these are following the format of JVM memory 
> configurations in a way they are using the very same size unit suffixes ("k", 
> "m", "g" or "t") but there is an inconsistency regarding the default unit. 
> When no suffix is given then the given amount is passed as it is to the JVM 
> (to the -Xmx and -Xms options) where this memory options are using bytes as a 
> default unit, for this please see the example 
> [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]:
> {noformat}
> The following examples show how to set the maximum allowed size of allocated 
> memory to 80 MB using various units:
> -Xmx83886080 
> -Xmx81920k 
> -Xmx80m
> {noformat}
> Although the Spark memory config default suffix unit is "m".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

2020-07-13 Thread Ruslan Dautkhanov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-32294:
--
Description: 
`spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
GroupedData, the whole group is passed to Pandas UDF at once, which can cause 
various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
2Gb limitation on Netty allocator side) - 
https://issues.apache.org/jira/browse/ARROW-4890 

Would be great to consider feeding GroupedData into a pandas UDF in batches to 
solve this issue. 

cc [~hyukjin.kwon] 

 

  was:
`spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
GroupedData, the whole group is passed to Pandas UDF as once, which can cause 
various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
2Gb limitation on Netty allocator side) - 
https://issues.apache.org/jira/browse/ARROW-4890 

Would be great to consider feeding GroupedData into a pandas UDF in batches to 
solve this issue. 

cc [~hyukjin.kwon] 

 


> GroupedData Pandas UDF 2Gb limit
> 
>
> Key: SPARK-32294
> URL: https://issues.apache.org/jira/browse/SPARK-32294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
> GroupedData, the whole group is passed to Pandas UDF at once, which can cause 
> various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
> 2Gb limitation on Netty allocator side) - 
> https://issues.apache.org/jira/browse/ARROW-4890 
> Would be great to consider feeding GroupedData into a pandas UDF in batches 
> to solve this issue. 
> cc [~hyukjin.kwon] 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

2020-07-13 Thread Ruslan Dautkhanov (Jira)

Ruslan Dautkhanov created SPARK-32294:
-

 Summary: GroupedData Pandas UDF 2Gb limit
 Key: SPARK-32294
 URL: https://issues.apache.org/jira/browse/SPARK-32294
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0, 3.1.0
Reporter: Ruslan Dautkhanov


`spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
GroupedData, the whole group is passed to Pandas UDF as once, which can cause 
various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
2Gb limitation on Netty allocator side) - 
https://issues.apache.org/jira/browse/ARROW-4890 

Would be great to consider feeding GroupedData into a pandas UDF in batches to 
solve this issue. 

cc [~hyukjin.kwon] 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30282) Migrate SHOW TBLPROPERTIES to new framework

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156900#comment-17156900
 ] 

Apache Spark commented on SPARK-30282:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/28375

> Migrate SHOW TBLPROPERTIES to new framework
> ---
>
> Key: SPARK-30282
> URL: https://issues.apache.org/jira/browse/SPARK-30282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> For the following v2 commands, _Analyzer.ResolveTables_ does not check 
> against the temp views before resolving _UnresolvedV2Relation_, thus it 
> always resolves _UnresolvedV2Relation_ to a table:
>  * ALTER TABLE
>  * DESCRIBE TABLE
>  * SHOW TBLPROPERTIES
> Thus, in the following example, 't' will be resolved to a table, not a temp 
> view:
> {code:java}
> sql("CREATE TEMPORARY VIEW t AS SELECT 2 AS i")
> sql("CREATE TABLE testcat.ns.t USING csv AS SELECT 1 AS i")
> sql("USE testcat.ns")
> sql("SHOW TBLPROPERTIES t") // 't' is resolved to a table
> {code}
> For V2 commands, if a table is resolved to a temp view, it should error out 
> with a message that v2 command cannot handle temp views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option

2020-07-13 Thread Attila Zsolt Piros (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-32293:
---
Summary: Inconsistent default unit between Spark memory configs and JVM 
option  (was: Inconsistent default units for configuring Spark memory)

> Inconsistent default unit between Spark memory configs and JVM option
> -
>
> Key: SPARK-32293
> URL: https://issues.apache.org/jira/browse/SPARK-32293
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 
> 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Spark's maximum memory can be configured in several ways:
> - via Spark config
> - command line argument
> - environment variables 
> Both for executors and for the driver the memory can be configured 
> separately. All of these are following the format of JVM memory 
> configurations in a way they are using the very same size unit suffixes ("k", 
> "m", "g" or "t") but there is an inconsistency regarding the default unit. 
> When no suffix is given then the given amount is passed as it is to the JVM 
> (to the -Xmx and -Xms options) where this memory options are using bytes as a 
> default unit, for this please see the example 
> [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]:
> {noformat}
> The following examples show how to set the maximum allowed size of allocated 
> memory to 80 MB using various units:
> -Xmx83886080 
> -Xmx81920k 
> -Xmx80m
> {noformat}
> Although the Spark memory config default suffix unit is "m".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32293) Inconsistent default units for configuring Spark memory

2020-07-13 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156884#comment-17156884
 ] 

Attila Zsolt Piros commented on SPARK-32293:


I am working on this.

> Inconsistent default units for configuring Spark memory
> ---
>
> Key: SPARK-32293
> URL: https://issues.apache.org/jira/browse/SPARK-32293
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 
> 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Spark's maximum memory can be configured in several ways:
> - via Spark config
> - command line argument
> - environment variables 
> Both for executors and for the driver the memory can be configured 
> separately. All of these are following the format of JVM memory 
> configurations in a way they are using the very same size unit suffixes ("k", 
> "m", "g" or "t") but there is an inconsistency regarding the default unit. 
> When no suffix is given then the given amount is passed as it is to the JVM 
> (to the -Xmx and -Xms options) where this memory options are using bytes as a 
> default unit, for this please see the example 
> [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]:
> {noformat}
> The following examples show how to set the maximum allowed size of allocated 
> memory to 80 MB using various units:
> -Xmx83886080 
> -Xmx81920k 
> -Xmx80m
> {noformat}
> Although the Spark memory config default suffix unit is "m".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32293) Inconsistent default units for configuring Spark memory

2020-07-13 Thread Attila Zsolt Piros (Jira)

Attila Zsolt Piros created SPARK-32293:
--

 Summary: Inconsistent default units for configuring Spark memory
 Key: SPARK-32293
 URL: https://issues.apache.org/jira/browse/SPARK-32293
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 3.0.0, 2.4.6, 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 
2.3.4, 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.2.3, 2.2.2, 2.2.1, 3.0.1, 3.1.0
Reporter: Attila Zsolt Piros


Spark's maximum memory can be configured in several ways:
- via Spark config
- command line argument
- environment variables 

Both for executors and for the driver the memory can be configured separately. 
All of these are following the format of JVM memory configurations in a way 
they are using the very same size unit suffixes ("k", "m", "g" or "t") but 
there is an inconsistency regarding the default unit. When no suffix is given 
then the given amount is passed as it is to the JVM (to the -Xmx and -Xms 
options) where this memory options are using bytes as a default unit, for this 
please see the example 
[here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]:

{noformat}
The following examples show how to set the maximum allowed size of allocated 
memory to 80 MB using various units:

-Xmx83886080 
-Xmx81920k 
-Xmx80m
{noformat}

Although the Spark memory config default suffix unit is "m".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32279) Install Sphinx in Python 3 on Jenkins machines

2020-07-13 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156846#comment-17156846
 ] 

Shane Knapp commented on SPARK-32279:
-

any particular version of sphinx  you want installed?

> Install Sphinx in Python 3 on Jenkins machines
> --
>
> Key: SPARK-32279
> URL: https://issues.apache.org/jira/browse/SPARK-32279
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Currently Sphinx is only installed in Python 2. We should install it in 
> Python 3 and test it in Jenkins as Python 2, 3.4 and 3.5 were dropped at 
> SPARK-32138.
> See also:
> https://github.com/apache/spark/pull/28957/files#diff-ccd847a0316575dde31bd89786bbe1f2R176
> https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/dev/lint-python#L176



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32276) Remove redundant sorts before repartition nodes

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156844#comment-17156844
 ] 

Apache Spark commented on SPARK-32276:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/29089

> Remove redundant sorts before repartition nodes
> ---
>
> Key: SPARK-32276
> URL: https://issues.apache.org/jira/browse/SPARK-32276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> I think our {{EliminateSorts}} rule can be extended further to remove sorts 
> before repartition, repartitionByExpression and coalesce nodes. Independently 
> of whether we do a shuffle or not, each repartition operation will change the 
> ordering and distribution of data.
> That's why we should be able to rewrite {{Repartition -> Sort -> Scan}} as 
> {{Repartition -> Scan}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32278) Install PyPy3 on Jenkins to enable PySpark tests with PyPy

2020-07-13 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156842#comment-17156842
 ] 

Shane Knapp edited comment on SPARK-32278 at 7/13/20, 5:03 PM:
---

which version of pypy3 are we interested in?  we currently have pypy 7.2 
(python 3.6.9) installed on the centos workers, and i'd like to nail down a 
version before i install this on the ubuntu nodes.

 

{{[sknapp@amp-jenkins-worker-05 ~]$ pypy3}}
 {{Python 3.6.9 (5da45ced70e515f94686be0df47c59abd1348ebc, Oct 17 2019, 
22:59:56)}}
 {{[PyPy 7.2.0 with GCC 8.2.0] on linux}}
 {{Type "help", "copyright", "credits" or "license" for more information.}}
 {{}}

 


was (Author: shaneknapp):
which version of pypy3 are we interested in?  we currently have 3.6.9 installed 
on the centos workers, and i'd like to nail down a version before i install 
this on the ubuntu nodes.

 

{{[sknapp@amp-jenkins-worker-05 ~]$ pypy3}}
{{Python 3.6.9 (5da45ced70e515f94686be0df47c59abd1348ebc, Oct 17 2019, 
22:59:56)}}
{{[PyPy 7.2.0 with GCC 8.2.0] on linux}}
{{Type "help", "copyright", "credits" or "license" for more information.}}
{{}}

 

> Install PyPy3 on Jenkins to enable PySpark tests with PyPy
> --
>
> Key: SPARK-32278
> URL: https://issues.apache.org/jira/browse/SPARK-32278
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Current PyPy installed in Jenkins is too old, which is Python 2 compatible. 
> Python 2 will be dropped at SPARK-32138, and we should now upgrade PyPy to 
> Python 3 compatible PyPy 3.
> See also:
> https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R160
> https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/python/run-tests.py#L160



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32276) Remove redundant sorts before repartition nodes

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156843#comment-17156843
 ] 

Apache Spark commented on SPARK-32276:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/29089

> Remove redundant sorts before repartition nodes
> ---
>
> Key: SPARK-32276
> URL: https://issues.apache.org/jira/browse/SPARK-32276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> I think our {{EliminateSorts}} rule can be extended further to remove sorts 
> before repartition, repartitionByExpression and coalesce nodes. Independently 
> of whether we do a shuffle or not, each repartition operation will change the 
> ordering and distribution of data.
> That's why we should be able to rewrite {{Repartition -> Sort -> Scan}} as 
> {{Repartition -> Scan}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32276) Remove redundant sorts before repartition nodes

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32276:


Assignee: Apache Spark

> Remove redundant sorts before repartition nodes
> ---
>
> Key: SPARK-32276
> URL: https://issues.apache.org/jira/browse/SPARK-32276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Assignee: Apache Spark
>Priority: Major
>
> I think our {{EliminateSorts}} rule can be extended further to remove sorts 
> before repartition, repartitionByExpression and coalesce nodes. Independently 
> of whether we do a shuffle or not, each repartition operation will change the 
> ordering and distribution of data.
> That's why we should be able to rewrite {{Repartition -> Sort -> Scan}} as 
> {{Repartition -> Scan}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32278) Install PyPy3 on Jenkins to enable PySpark tests with PyPy

2020-07-13 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156842#comment-17156842
 ] 

Shane Knapp commented on SPARK-32278:
-

which version of pypy3 are we interested in?  we currently have 3.6.9 installed 
on the centos workers, and i'd like to nail down a version before i install 
this on the ubuntu nodes.

 

{{[sknapp@amp-jenkins-worker-05 ~]$ pypy3}}
{{Python 3.6.9 (5da45ced70e515f94686be0df47c59abd1348ebc, Oct 17 2019, 
22:59:56)}}
{{[PyPy 7.2.0 with GCC 8.2.0] on linux}}
{{Type "help", "copyright", "credits" or "license" for more information.}}
{{}}

 

> Install PyPy3 on Jenkins to enable PySpark tests with PyPy
> --
>
> Key: SPARK-32278
> URL: https://issues.apache.org/jira/browse/SPARK-32278
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Current PyPy installed in Jenkins is too old, which is Python 2 compatible. 
> Python 2 will be dropped at SPARK-32138, and we should now upgrade PyPy to 
> Python 3 compatible PyPy 3.
> See also:
> https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R160
> https://github.com/apache/spark/blob/ec42492b60559a983435a24630d5dc8827cf22d9/python/run-tests.py#L160



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32276) Remove redundant sorts before repartition nodes

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32276:


Assignee: (was: Apache Spark)

> Remove redundant sorts before repartition nodes
> ---
>
> Key: SPARK-32276
> URL: https://issues.apache.org/jira/browse/SPARK-32276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> I think our {{EliminateSorts}} rule can be extended further to remove sorts 
> before repartition, repartitionByExpression and coalesce nodes. Independently 
> of whether we do a shuffle or not, each repartition operation will change the 
> ordering and distribution of data.
> That's why we should be able to rewrite {{Repartition -> Sort -> Scan}} as 
> {{Repartition -> Scan}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32252) Enable doctests in run-tests.py back

2020-07-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32252:
-

Assignee: Hyukjin Kwon

> Enable doctests in run-tests.py back
> 
>
> Key: SPARK-32252
> URL: https://issues.apache.org/jira/browse/SPARK-32252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} 
> is set. This is mainly because the doctests fail in Github Actions.
> We should test it. Currently it fails as below:
> {code}
> fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> **
> File "./dev/run-tests.py", line 75, in 
> __main__.identify_changed_files_from_git_commits
> Failed example:
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/doctest.py", line 1330, in __run
> compileflags, 1), test.globs)
>   File "", 
> line 1, in 
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
>   File "./dev/run-tests.py", line 87, in 
> identify_changed_files_from_git_commits
> universal_newlines=True)
>   File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
> **kwargs).stdout
>   File "/usr/lib/python3.6/subprocess.py", line 438, in run
> output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command '['git', 'diff-tree', 
> '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit 
> status 128.
> fatal: ambiguous argument '50a0496a43': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> {code}
> Looks we should fetch the commit to test in GitHub Actions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32252) Enable doctests in run-tests.py back

2020-07-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32252.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29086
[https://github.com/apache/spark/pull/29086]

> Enable doctests in run-tests.py back
> 
>
> Key: SPARK-32252
> URL: https://issues.apache.org/jira/browse/SPARK-32252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} 
> is set. This is mainly because the doctests fail in Github Actions.
> We should test it. Currently it fails as below:
> {code}
> fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> **
> File "./dev/run-tests.py", line 75, in 
> __main__.identify_changed_files_from_git_commits
> Failed example:
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/doctest.py", line 1330, in __run
> compileflags, 1), test.globs)
>   File "", 
> line 1, in 
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
>   File "./dev/run-tests.py", line 87, in 
> identify_changed_files_from_git_commits
> universal_newlines=True)
>   File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
> **kwargs).stdout
>   File "/usr/lib/python3.6/subprocess.py", line 438, in run
> output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command '['git', 'diff-tree', 
> '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit 
> status 128.
> fatal: ambiguous argument '50a0496a43': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> {code}
> Looks we should fetch the commit to test in GitHub Actions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32292) Run only relevant builds in parallel at Github Actions

2020-07-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32292.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29086
[https://github.com/apache/spark/pull/29086]

> Run only relevant builds in parallel at Github Actions
> --
>
> Key: SPARK-32292
> URL: https://issues.apache.org/jira/browse/SPARK-32292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> Jenkins already runs only relevant tests. Github Actions should also reuse 
> and follow it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32289) Chinese characters are garbled when opening csv files with Excel

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32289:


Assignee: (was: Apache Spark)

> Chinese characters are garbled when opening csv files with Excel
> 
>
> Key: SPARK-32289
> URL: https://issues.apache.org/jira/browse/SPARK-32289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: garbled.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.sql("SELECT '我爱中文' AS chinese").write.option("header", 
> "true").csv("/tmp/spark/csv")
> {code}
>  !garbled.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32289) Chinese characters are garbled when opening csv files with Excel

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156780#comment-17156780
 ] 

Apache Spark commented on SPARK-32289:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29088

> Chinese characters are garbled when opening csv files with Excel
> 
>
> Key: SPARK-32289
> URL: https://issues.apache.org/jira/browse/SPARK-32289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: garbled.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.sql("SELECT '我爱中文' AS chinese").write.option("header", 
> "true").csv("/tmp/spark/csv")
> {code}
>  !garbled.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32289) Chinese characters are garbled when opening csv files with Excel

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32289:


Assignee: Apache Spark

> Chinese characters are garbled when opening csv files with Excel
> 
>
> Key: SPARK-32289
> URL: https://issues.apache.org/jira/browse/SPARK-32289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: garbled.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.sql("SELECT '我爱中文' AS chinese").write.option("header", 
> "true").csv("/tmp/spark/csv")
> {code}
>  !garbled.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28227) Spark can’t support TRANSFORM with aggregation

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156698#comment-17156698
 ] 

Apache Spark commented on SPARK-28227:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29087

> Spark can’t  support TRANSFORM with aggregation
> ---
>
> Key: SPARK-28227
> URL: https://issues.apache.org/jira/browse/SPARK-28227
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Spark can;t support using TRANSFORM with aggregation such as :
> {code:java}
> SELECT TRANSFORM(T.A, SUM(T.B))
> USING 'func' AS (X STRING Y STRING)
> FROM DEFAULT.TEST T
> WHERE T.C > 0
> GROUP BY T.A{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28227) Spark can’t support TRANSFORM with aggregation

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156699#comment-17156699
 ] 

Apache Spark commented on SPARK-28227:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29087

> Spark can’t  support TRANSFORM with aggregation
> ---
>
> Key: SPARK-28227
> URL: https://issues.apache.org/jira/browse/SPARK-28227
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Spark can;t support using TRANSFORM with aggregation such as :
> {code:java}
> SELECT TRANSFORM(T.A, SUM(T.B))
> USING 'func' AS (X STRING Y STRING)
> FROM DEFAULT.TEST T
> WHERE T.C > 0
> GROUP BY T.A{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32252) Enable doctests in run-tests.py back

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156641#comment-17156641
 ] 

Apache Spark commented on SPARK-32252:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29086

> Enable doctests in run-tests.py back
> 
>
> Key: SPARK-32252
> URL: https://issues.apache.org/jira/browse/SPARK-32252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} 
> is set. This is mainly because the doctests fail in Github Actions.
> We should test it. Currently it fails as below:
> {code}
> fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> **
> File "./dev/run-tests.py", line 75, in 
> __main__.identify_changed_files_from_git_commits
> Failed example:
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/doctest.py", line 1330, in __run
> compileflags, 1), test.globs)
>   File "", 
> line 1, in 
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
>   File "./dev/run-tests.py", line 87, in 
> identify_changed_files_from_git_commits
> universal_newlines=True)
>   File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
> **kwargs).stdout
>   File "/usr/lib/python3.6/subprocess.py", line 438, in run
> output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command '['git', 'diff-tree', 
> '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit 
> status 128.
> fatal: ambiguous argument '50a0496a43': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> {code}
> Looks we should fetch the commit to test in GitHub Actions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32252) Enable doctests in run-tests.py back

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156639#comment-17156639
 ] 

Apache Spark commented on SPARK-32252:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29086

> Enable doctests in run-tests.py back
> 
>
> Key: SPARK-32252
> URL: https://issues.apache.org/jira/browse/SPARK-32252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} 
> is set. This is mainly because the doctests fail in Github Actions.
> We should test it. Currently it fails as below:
> {code}
> fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> **
> File "./dev/run-tests.py", line 75, in 
> __main__.identify_changed_files_from_git_commits
> Failed example:
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/doctest.py", line 1330, in __run
> compileflags, 1), test.globs)
>   File "", 
> line 1, in 
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
>   File "./dev/run-tests.py", line 87, in 
> identify_changed_files_from_git_commits
> universal_newlines=True)
>   File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
> **kwargs).stdout
>   File "/usr/lib/python3.6/subprocess.py", line 438, in run
> output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command '['git', 'diff-tree', 
> '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit 
> status 128.
> fatal: ambiguous argument '50a0496a43': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> {code}
> Looks we should fetch the commit to test in GitHub Actions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32292) Run only relevant builds in parallel at Github Actions

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32292:


Assignee: Apache Spark

> Run only relevant builds in parallel at Github Actions
> --
>
> Key: SPARK-32292
> URL: https://issues.apache.org/jira/browse/SPARK-32292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Jenkins already runs only relevant tests. Github Actions should also reuse 
> and follow it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32252) Enable doctests in run-tests.py back

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32252:


Assignee: Apache Spark

> Enable doctests in run-tests.py back
> 
>
> Key: SPARK-32252
> URL: https://issues.apache.org/jira/browse/SPARK-32252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} 
> is set. This is mainly because the doctests fail in Github Actions.
> We should test it. Currently it fails as below:
> {code}
> fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> **
> File "./dev/run-tests.py", line 75, in 
> __main__.identify_changed_files_from_git_commits
> Failed example:
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/doctest.py", line 1330, in __run
> compileflags, 1), test.globs)
>   File "", 
> line 1, in 
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
>   File "./dev/run-tests.py", line 87, in 
> identify_changed_files_from_git_commits
> universal_newlines=True)
>   File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
> **kwargs).stdout
>   File "/usr/lib/python3.6/subprocess.py", line 438, in run
> output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command '['git', 'diff-tree', 
> '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit 
> status 128.
> fatal: ambiguous argument '50a0496a43': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> {code}
> Looks we should fetch the commit to test in GitHub Actions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32292) Run only relevant builds in parallel at Github Actions

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32292:


Assignee: Apache Spark

> Run only relevant builds in parallel at Github Actions
> --
>
> Key: SPARK-32292
> URL: https://issues.apache.org/jira/browse/SPARK-32292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Jenkins already runs only relevant tests. Github Actions should also reuse 
> and follow it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32252) Enable doctests in run-tests.py back

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32252:


Assignee: (was: Apache Spark)

> Enable doctests in run-tests.py back
> 
>
> Key: SPARK-32252
> URL: https://issues.apache.org/jira/browse/SPARK-32252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In run-tests.py script, we're skipping the tests when {{TEST_ONLY_MODULES}} 
> is set. This is mainly because the doctests fail in Github Actions.
> We should test it. Currently it fails as below:
> {code}
> fatal: ambiguous argument 'fc0a1475ef': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> **
> File "./dev/run-tests.py", line 75, in 
> __main__.identify_changed_files_from_git_commits
> Failed example:
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/doctest.py", line 1330, in __run
> compileflags, 1), test.globs)
>   File "", 
> line 1, in 
> [x.name for x in determine_modules_for_files( 
> identify_changed_files_from_git_commits("fc0a1475ef", target_ref="5da21f07"))]
>   File "./dev/run-tests.py", line 87, in 
> identify_changed_files_from_git_commits
> universal_newlines=True)
>   File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
> **kwargs).stdout
>   File "/usr/lib/python3.6/subprocess.py", line 438, in run
> output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command '['git', 'diff-tree', 
> '--no-commit-id', '--name-only', '-r', 'fc0a1475ef']' returned non-zero exit 
> status 128.
> fatal: ambiguous argument '50a0496a43': unknown revision or path not in the 
> working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
> {code}
> Looks we should fetch the commit to test in GitHub Actions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32292) Run only relevant builds in parallel at Github Actions

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156636#comment-17156636
 ] 

Apache Spark commented on SPARK-32292:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29086

> Run only relevant builds in parallel at Github Actions
> --
>
> Key: SPARK-32292
> URL: https://issues.apache.org/jira/browse/SPARK-32292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Jenkins already runs only relevant tests. Github Actions should also reuse 
> and follow it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32292) Run only relevant builds in parallel at Github Actions

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32292:


Assignee: (was: Apache Spark)

> Run only relevant builds in parallel at Github Actions
> --
>
> Key: SPARK-32292
> URL: https://issues.apache.org/jira/browse/SPARK-32292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Jenkins already runs only relevant tests. Github Actions should also reuse 
> and follow it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32106) Implement script transform in sql/core

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32106:


Assignee: Apache Spark

> Implement script transform in sql/core
> --
>
> Key: SPARK-32106
> URL: https://issues.apache.org/jira/browse/SPARK-32106
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32106) Implement script transform in sql/core

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156629#comment-17156629
 ] 

Apache Spark commented on SPARK-32106:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29085

> Implement script transform in sql/core
> --
>
> Key: SPARK-32106
> URL: https://issues.apache.org/jira/browse/SPARK-32106
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32106) Implement script transform in sql/core

2020-07-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32106:


Assignee: (was: Apache Spark)

> Implement script transform in sql/core
> --
>
> Key: SPARK-32106
> URL: https://issues.apache.org/jira/browse/SPARK-32106
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32259) tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s

2020-07-13 Thread Rob Vesse (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156627#comment-17156627
 ] 

Rob Vesse commented on SPARK-32259:
---

bq. We use Spark launcher to do spark submit in k8s. Since it is evicted, the 
pod logs for stack trace is not available. we have only pod events given in 
attachment

You should still be able to use {{kubectl logs}} to retrieve the logs of 
terminated pods unless these are executor pods that are being evicted since I 
believe Spark cleans those up automatically.  You can add 
{{spark.kubernetes.executor.deleteOnTermination=false}} to your configuration 
to disable this behaviour so that you can go and retrieve those logs later.

> tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s
> ---
>
> Key: SPARK-32259
> URL: https://issues.apache.org/jira/browse/SPARK-32259
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prakash Rajendran
>Priority: Blocker
> Attachments: Capture.PNG
>
>
> In Spark-Submit, I have these config 
> "{color:#4c9aff}*spark.kubernetes.local.dirs.tmpfs=true*{color}", still spark 
> is not pointing its spill data to SPARK_LOCAL_DIRS path.
> K8s is evicting the pod due to error "{color:#de350b}*Pod ephemeral local 
> storage usage exceeds the total limit of containers.*{color}"
>  
> We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod 
> logs for stack trace is not available. we have only pod events given in 
> attachment
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32259) tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s

2020-07-13 Thread Rob Vesse (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156626#comment-17156626
 ] 

Rob Vesse edited comment on SPARK-32259 at 7/13/20, 10:32 AM:
--

[~prakki79] Ideally you'd also include the following in your report:

* The full {{spark-submit}} command
* The {{spark-defaults.conf}} or whatever configuration file you are using (if 
any)
* The {{kubectl describe pod}} output for the relevant pod(s)
* The {{kubectl get pod -o=yaml}} output for the relevant pod(s)

bq. I have these config "spark.kubernetes.local.dirs.tmpfs=true", still spark 
is not pointing its spill data to SPARK_LOCAL_DIRS path.

Nothing you have shown so far suggests that this is true, all that 
configuration setting does is change how Spark configures the relevant 
{{emptyDir}} volume used for ephemeral storage (and that's assuming you haven't 
supplied other configuration that explicitly configures local directories).

You can exhaust an in-memory volume in exactly the same as you exhaust a disk 
based volume and get your pod evicted.  Note that when using in-memory volumes 
then you may need to adjust the amount of memory allocated to your pod per the 
documentation - 
http://spark.apache.org/docs/latest/running-on-kubernetes.html#using-ram-for-local-storage




was (Author: rvesse):
[~prakki79] Ideally you'd also include the following in your report:

* The full {{spark-submit}} command
* The {{kubectl describe pod}} output for the relevant pod(s)
* The {{kubectl get pod -o=yaml}} output for the relevant pod(s)

bq. I have these config "spark.kubernetes.local.dirs.tmpfs=true", still spark 
is not pointing its spill data to SPARK_LOCAL_DIRS path.

Nothing you have shown so far suggests that this is true, all that 
configuration setting does is change how Spark configures the relevant 
{{emptyDir}} volume used for ephemeral storage (and that's assuming you haven't 
supplied other configuration that explicitly configures local directories).

You can exhaust an in-memory volume in exactly the same as you exhaust a disk 
based volume and get your pod evicted.  Note that when using in-memory volumes 
then you may need to adjust the amount of memory allocated to your pod per the 
documentation - 
http://spark.apache.org/docs/latest/running-on-kubernetes.html#using-ram-for-local-storage



> tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s
> ---
>
> Key: SPARK-32259
> URL: https://issues.apache.org/jira/browse/SPARK-32259
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prakash Rajendran
>Priority: Blocker
> Attachments: Capture.PNG
>
>
> In Spark-Submit, I have these config 
> "{color:#4c9aff}*spark.kubernetes.local.dirs.tmpfs=true*{color}", still spark 
> is not pointing its spill data to SPARK_LOCAL_DIRS path.
> K8s is evicting the pod due to error "{color:#de350b}*Pod ephemeral local 
> storage usage exceeds the total limit of containers.*{color}"
>  
> We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod 
> logs for stack trace is not available. we have only pod events given in 
> attachment
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32259) tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s

2020-07-13 Thread Rob Vesse (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156626#comment-17156626
 ] 

Rob Vesse commented on SPARK-32259:
---

[~prakki79] Ideally you'd also include the following in your report:

* The full {{spark-submit}} command
* The {{kubectl describe pod}} output for the relevant pod(s)
* The {{kubectl get pod -o=yaml}} output for the relevant pod(s)

bq. I have these config "spark.kubernetes.local.dirs.tmpfs=true", still spark 
is not pointing its spill data to SPARK_LOCAL_DIRS path.

Nothing you have shown so far suggests that this is true, all that 
configuration setting does is change how Spark configures the relevant 
{{emptyDir}} volume used for ephemeral storage (and that's assuming you haven't 
supplied other configuration that explicitly configures local directories).

You can exhaust an in-memory volume in exactly the same as you exhaust a disk 
based volume and get your pod evicted.  Note that when using in-memory volumes 
then you may need to adjust the amount of memory allocated to your pod per the 
documentation - 
http://spark.apache.org/docs/latest/running-on-kubernetes.html#using-ram-for-local-storage



> tmpfs=true, not pointing to SPARK_LOCAL_DIRS in k8s
> ---
>
> Key: SPARK-32259
> URL: https://issues.apache.org/jira/browse/SPARK-32259
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prakash Rajendran
>Priority: Blocker
> Attachments: Capture.PNG
>
>
> In Spark-Submit, I have these config 
> "{color:#4c9aff}*spark.kubernetes.local.dirs.tmpfs=true*{color}", still spark 
> is not pointing its spill data to SPARK_LOCAL_DIRS path.
> K8s is evicting the pod due to error "{color:#de350b}*Pod ephemeral local 
> storage usage exceeds the total limit of containers.*{color}"
>  
> We use Spark launcher to do spark submit in k8s. Since it is evicted, the pod 
> logs for stack trace is not available. we have only pod events given in 
> attachment
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32292) Run only relevant builds in parallel at Github Actions

2020-07-13 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-32292:


 Summary: Run only relevant builds in parallel at Github Actions
 Key: SPARK-32292
 URL: https://issues.apache.org/jira/browse/SPARK-32292
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


Jenkins already runs only relevant tests. Github Actions should also reuse and 
follow it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32253) Make readability better in the test result logs

2020-07-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156609#comment-17156609
 ] 

Hyukjin Kwon commented on SPARK-32253:
--

See also https://github.com/check-run-reporter/action

> Make readability better in the test result logs
> ---
>
> Key: SPARK-32253
> URL: https://issues.apache.org/jira/browse/SPARK-32253
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, the readability in the logs are not really good. For example, see 
> https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D
> We should have a way to easily see the failed test cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32105) Refactor current script transform code

2020-07-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32105.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27983
[https://github.com/apache/spark/pull/27983]

> Refactor current script transform code
> --
>
> Key: SPARK-32105
> URL: https://issues.apache.org/jira/browse/SPARK-32105
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32105) Refactor current script transform code

2020-07-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32105:
---

Assignee: angerszhu

> Refactor current script transform code
> --
>
> Key: SPARK-32105
> URL: https://issues.apache.org/jira/browse/SPARK-32105
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-07-13 Thread Prashant Sharma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-30985:

Description: 
SPARK_CONF_DIR hosts configuration files like, 
 1) spark-defaults.conf - containing all the spark properties.
 2) log4j.properties - Logger configuration.
 3) spark-env.sh - Environment variables to be setup at driver and executor.
 4) core-site.xml - Hadoop related configuration.
 5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
 6) metrics.properties - Spark metrics.
 7) Any user specific - library or framework specific configuration file.

Traditionally, SPARK_CONF_DIR has been the home to all user specific 
configuration files.

So this feature, will let the user specific configuration files be mounted on 
the driver and executor pods' SPARK_CONF_DIR.

Please review the attached design doc, for more details.

 

[Google docs link|https://bit.ly/spark-30985]

 

  was:
SPARK_CONF_DIR hosts configuration files like, 
 1) spark-defaults.conf - containing all the spark properties.
 2) log4j.properties - Logger configuration.
 3) spark-env.sh - Environment variables to be setup at driver and executor.
 4) core-site.xml - Hadoop related configuration.
 5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
 6) metrics.properties - Spark metrics.
 7) Any user specific - library or framework specific configuration file.

Traditionally, SPARK_CONF_DIR has been the home to all user specific 
configuration files.

So this feature, will let the user specific configuration files be mounted on 
the driver and executor pods' SPARK_CONF_DIR.

Please review the attached design doc, for more details.

 

[https://docs.google.com/document/d/1DUmNqMz5ky55yfegdh4e_CeItM_nqtrglFqFxsTxeeA/edit?usp=sharing]

 


> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [Google docs link|https://bit.ly/spark-30985]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32289) Chinese characters are garbled when opening csv files with Excel

2020-07-13 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156548#comment-17156548
 ] 

angerszhu commented on SPARK-32289:
---

[https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4508058]

> Chinese characters are garbled when opening csv files with Excel
> 
>
> Key: SPARK-32289
> URL: https://issues.apache.org/jira/browse/SPARK-32289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: garbled.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.sql("SELECT '我爱中文' AS chinese").write.option("header", 
> "true").csv("/tmp/spark/csv")
> {code}
>  !garbled.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join

2020-07-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32291:

Attachment: coalesce.png

> COALESCE should not reduce the child parallelism if it is Join
> --
>
> Key: SPARK-32291
> URL: https://issues.apache.org/jira/browse/SPARK-32291
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: COALESCE.png, coalesce.png, repartition.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.range(100).createTempView("t1")
> spark.range(200).createTempView("t2")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = 
> t2.id)").show
> {code}
> The dag is:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join

2020-07-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32291:

Description: 
How to reproduce this issue:
{code:scala}
spark.range(100).createTempView("t1")
spark.range(200).createTempView("t2")
spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = 
t2.id)").show
{code}

The dag is:
 !COALESCE.png! 

A real case:
 !coalesce.png! 
 !repartition.png! 


  was:
How to reproduce this issue:
{code:scala}
spark.range(100).createTempView("t1")
spark.range(200).createTempView("t2")
spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = 
t2.id)").show
{code}

The dag is:




> COALESCE should not reduce the child parallelism if it is Join
> --
>
> Key: SPARK-32291
> URL: https://issues.apache.org/jira/browse/SPARK-32291
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: COALESCE.png, coalesce.png, repartition.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.range(100).createTempView("t1")
> spark.range(200).createTempView("t2")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = 
> t2.id)").show
> {code}
> The dag is:
>  !COALESCE.png! 
> A real case:
>  !coalesce.png! 
>  !repartition.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join

2020-07-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32291:

Attachment: repartition.png

> COALESCE should not reduce the child parallelism if it is Join
> --
>
> Key: SPARK-32291
> URL: https://issues.apache.org/jira/browse/SPARK-32291
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: COALESCE.png, repartition.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.range(100).createTempView("t1")
> spark.range(200).createTempView("t2")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = 
> t2.id)").show
> {code}
> The dag is:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join

2020-07-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32291:

Attachment: COALESCE.png

> COALESCE should not reduce the child parallelism if it is Join
> --
>
> Key: SPARK-32291
> URL: https://issues.apache.org/jira/browse/SPARK-32291
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: COALESCE.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.range(100).createTempView("t1")
> spark.range(200).createTempView("t2")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = 
> t2.id)").show
> {code}
> The dag is:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32291) COALESCE should not reduce the child parallelism if it is Join

2020-07-13 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-32291:
---

 Summary: COALESCE should not reduce the child parallelism if it is 
Join
 Key: SPARK-32291
 URL: https://issues.apache.org/jira/browse/SPARK-32291
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang
 Attachments: COALESCE.png

How to reproduce this issue:
{code:scala}
spark.range(100).createTempView("t1")
spark.range(200).createTempView("t2")
spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
spark.sql("select /*+ COALESCE(1) */ t1.* from t1 join t2 on (t1.id = 
t2.id)").show
{code}

The dag is:





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32220) Cartesian Product Hint cause data error

2020-07-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156540#comment-17156540
 ] 

Apache Spark commented on SPARK-32220:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29084

> Cartesian Product Hint cause data error
> ---
>
> Key: SPARK-32220
> URL: https://issues.apache.org/jira/browse/SPARK-32220
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.1, 3.1.0
>
>
> {code:java}
> spark-sql> select * from test4 order by a asc;
> 1 2
> Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO 
> SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s)
> spark-sql>select * from test5 order by a asc
> 1 2
> 2 2
> Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO 
> SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar
> spark-sql>select /*+ shuffle_replicate_nl(test4) */ * from test4 join test5 
> where test4.a = test5.a order by test4.a asc ;
> 1 2 1 2
> 1 2 2 2
> Time taken: 0.351 seconds, Fetched 2 row(s)
> 20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched 
> 2 row(s){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-07-13 Thread Prashant Sharma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-30985:

Component/s: (was: Spark Core)

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [https://docs.google.com/document/d/1DUmNqMz5ky55yfegdh4e_CeItM_nqtrglFqFxsTxeeA/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32290) NotInSubquery SingleColumn Optimize

2020-07-13 Thread Leanken.Lin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32290:

Fix Version/s: 3.0.1

> NotInSubquery SingleColumn Optimize
> ---
>
> Key: SPARK-32290
> URL: https://issues.apache.org/jira/browse/SPARK-32290
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Normally,
> A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very 
> very time consuming. For example, I've done TPCH benchmark lately, Query 16 
> almost took half of the entire TPCH 22Query execution Time. So i proposed 
> that to do the following optimize.
> Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only 
> single column in following pattern.
> {code:java}
> case _@Or(
> _@EqualTo(leftAttr: AttributeReference, rightAttr: 
> AttributeReference),
> _@IsNull(
>   _@EqualTo(_: AttributeReference, _: AttributeReference)
> )
>   )
> {code}
> if buildSide rows is small enough, we can change build side data into a 
> HashMap.
> so the M*N calculation can be optimized into M*log(N)
> I've done a benchmark job in 1TB TPCH, before apply the optimize
> Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it 
> takes only 30s to finish. 
> But this optimize only works on single column not in subquery, so i am here 
> to seek advise whether the community need this update or not. I will do the 
> pull request first, if the community member thought it's hack, it's fine to 
> just ignore this request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32290) NotInSubquery SingleColumn Optimize

2020-07-13 Thread Leanken.Lin (Jira)

Leanken.Lin created SPARK-32290:
---

 Summary: NotInSubquery SingleColumn Optimize
 Key: SPARK-32290
 URL: https://issues.apache.org/jira/browse/SPARK-32290
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin
 Fix For: 3.1.0


Normally,
A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very very 
time consuming. For example, I've done TPCH benchmark lately, Query 16 almost 
took half of the entire TPCH 22Query execution Time. So i proposed that to do 
the following optimize.

Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only 
single column in following pattern.

{code:java}
case _@Or(
_@EqualTo(leftAttr: AttributeReference, rightAttr: 
AttributeReference),
_@IsNull(
  _@EqualTo(_: AttributeReference, _: AttributeReference)
)
  )
{code}

if buildSide rows is small enough, we can change build side data into a HashMap.
so the M*N calculation can be optimized into M*log(N)

I've done a benchmark job in 1TB TPCH, before apply the optimize
Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it 
takes only 30s to finish. 

But this optimize only works on single column not in subquery, so i am here to 
seek advise whether the community need this update or not. I will do the pull 
request first, if the community member thought it's hack, it's fine to just 
ignore this request.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

99 matches

Mail list logo