date:20200723

[jira] [Commented] (SPARK-32253) Make readability better in the test result logs

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164148#comment-17164148
 ] 

Apache Spark commented on SPARK-32253:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29219

> Make readability better in the test result logs
> ---
>
> Key: SPARK-32253
> URL: https://issues.apache.org/jira/browse/SPARK-32253
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the readability in the logs are not really good. For example, see 
> https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z=HMACV1=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D
> We should have a way to easily see the failed test cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32253) Make readability better in the test result logs

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164150#comment-17164150
 ] 

Apache Spark commented on SPARK-32253:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29219

> Make readability better in the test result logs
> ---
>
> Key: SPARK-32253
> URL: https://issues.apache.org/jira/browse/SPARK-32253
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the readability in the logs are not really good. For example, see 
> https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z=HMACV1=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D
> We should have a way to easily see the failed test cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32408) Enable crossPaths back to prevent side effects

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164135#comment-17164135
 ] 

Apache Spark commented on SPARK-32408:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29218

> Enable crossPaths back to prevent side effects
> --
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep \-r "target/scala\-"}}.
> To minimise the side effects, we should enable crossPaths back for now.
> This is actually an issue in Jenkins jobs as well. See 
> https://github.com/apache/spark/pull/29205 for the analysis made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32408) Enable crossPaths back to prevent side effects

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164134#comment-17164134
 ] 

Apache Spark commented on SPARK-32408:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29218

> Enable crossPaths back to prevent side effects
> --
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep \-r "target/scala\-"}}.
> To minimise the side effects, we should enable crossPaths back for now.
> This is actually an issue in Jenkins jobs as well. See 
> https://github.com/apache/spark/pull/29205 for the analysis made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32308) Move by-name resolution logic of unionByName from API code to analysis phase

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32308.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29107
[https://github.com/apache/spark/pull/29107]

> Move by-name resolution logic of unionByName from API code to analysis phase
> 
>
> Key: SPARK-32308
> URL: https://issues.apache.org/jira/browse/SPARK-32308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently the by-name resolution logic of unionByName is put in API code. We 
> should move the logic to analysis phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32372) "Resolved attribute(s) XXX missing" after dudup conflict references

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32372:

Fix Version/s: 2.4.7

> "Resolved attribute(s) XXX missing" after dudup conflict references
> ---
>
> Key: SPARK-32372
> URL: https://issues.apache.org/jira/browse/SPARK-32372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.4, 2.4.6, 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> {code:java}
> // case class Person(id: Int, name: String, age: Int)
> sql("SELECT name, avg(age) as avg_age FROM person GROUP BY 
> name").createOrReplaceTempView("person_a")
> sql("SELECT p1.name, p2.avg_age FROM person p1 JOIN person_a p2 ON p1.name = 
> p2.name").createOrReplaceTempView("person_b")
> sql("SELECT * FROM person_a UNION SELECT * FROM person_b")   
> .createOrReplaceTempView("person_c")
> sql("SELECT p1.name, p2.avg_age FROM person_c p1 JOIN person_c p2 ON p1.name 
> = p2.name").show
> {code}
> error:
> {code:java}
> [info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: 
> Resolved attribute(s) avg_age#235 missing from name#233,avg_age#231 in 
> operator !Project [name#233, avg_age#235]. Attribute(s) with the same name 
> appear in the operation: avg_age. Please check if the right attribute(s) are 
> used.;;
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32280) AnalysisException thrown when query contains several JOINs

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32280:

Fix Version/s: 2.4.7

> AnalysisException thrown when query contains several JOINs
> --
>
> Key: SPARK-32280
> URL: https://issues.apache.org/jira/browse/SPARK-32280
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: David Lindelöf
>Assignee: wuyi
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> I've come across a curious {{AnalysisException}} thrown in one of my SQL 
> queries, even though the SQL appears legitimate. I was able to reduce it to 
> this example:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> spark.sql('SELECT 1 AS id').createOrReplaceTempView('A')
> spark.sql('''
>  SELECT id,
>  'foo' AS kind
>  FROM A''').createOrReplaceTempView('B')
> spark.sql('''
>  SELECT l.id
>  FROM B AS l
>  JOIN B AS r
>  ON l.kind = r.kind''').createOrReplaceTempView('C')
> spark.sql('''
>  SELECT 0
>  FROM (
>SELECT *
>FROM B
>JOIN C
>USING (id))
>  JOIN (
>SELECT *
>FROM B
>JOIN C
>USING (id))
>  USING (id)''')
> {code}
> Running this yields the following error:
> {code}
>  py4j.protocol.Py4JJavaError: An error occurred while calling o20.sql.
> : org.apache.spark.sql.AnalysisException: Resolved attribute(s) kind#11 
> missing from id#10,kind#2,id#7,kind#5 in operator !Join Inner, (kind#11 = 
> kind#5). Attribute(s) with the same name appear in the operation: kind. 
> Please check if the right attribute(s) are used.;;
> Project [0 AS 0#15]
> +- Project [id#0, kind#2, kind#11]
>+- Join Inner, (id#0 = id#14)
>   :- SubqueryAlias `__auto_generated_subquery_name`
>   :  +- Project [id#0, kind#2]
>   : +- Project [id#0, kind#2]
>   :+- Join Inner, (id#0 = id#9)
>   :   :- SubqueryAlias `b`
>   :   :  +- Project [id#0, foo AS kind#2]
>   :   : +- SubqueryAlias `a`
>   :   :+- Project [1 AS id#0]
>   :   :   +- OneRowRelation
>   :   +- SubqueryAlias `c`
>   :  +- Project [id#9]
>   : +- Join Inner, (kind#2 = kind#5)
>   ::- SubqueryAlias `l`
>   ::  +- SubqueryAlias `b`
>   :: +- Project [id#9, foo AS kind#2]
>   ::+- SubqueryAlias `a`
>   ::   +- Project [1 AS id#9]
>   ::  +- OneRowRelation
>   :+- SubqueryAlias `r`
>   :   +- SubqueryAlias `b`
>   :  +- Project [id#7, foo AS kind#5]
>   : +- SubqueryAlias `a`
>   :+- Project [1 AS id#7]
>   :   +- OneRowRelation
>   +- SubqueryAlias `__auto_generated_subquery_name`
>  +- Project [id#14, kind#11]
> +- Project [id#14, kind#11]
>+- Join Inner, (id#14 = id#10)
>   :- SubqueryAlias `b`
>   :  +- Project [id#14, foo AS kind#11]
>   : +- SubqueryAlias `a`
>   :+- Project [1 AS id#14]
>   :   +- OneRowRelation
>   +- SubqueryAlias `c`
>  +- Project [id#10]
> +- !Join Inner, (kind#11 = kind#5)
>:- SubqueryAlias `l`
>:  +- SubqueryAlias `b`
>: +- Project [id#10, foo AS kind#2]
>:+- SubqueryAlias `a`
>:   +- Project [1 AS id#10]
>:  +- OneRowRelation
>+- SubqueryAlias `r`
>   +- SubqueryAlias `b`
>  +- Project [id#7, foo AS kind#5]
> +- SubqueryAlias `a`
>+- Project [1 AS id#7]
>   +- OneRowRelation
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:369)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
>   at 
>

[jira] [Updated] (SPARK-32408) Enable crossPaths back to prevent side effects

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32408:
-
Description: 
After SPARK-32245, crossPaths was disabled in SBT build to run the Junit tests 
per project properly.
This is correct change since we're not doing the cross build in SBT.

Now, the intermediate classes are placed without Scala version directory in SBT 
build specifically.
It seems causing side effects that are dependent on that path. See, for 
example, {{git grep \-r "target/scala\-"}}.

To minimise the side effects, we should enable crossPaths back for now.

This is actually an issue in Jenkins jobs as well. See 

  was:
After SPARK-32245, crossPaths was disabled in SBT build to run the Junit tests 
per project properly.
This is correct change since we're not doing the cross build in SBT.

Now, the intermediate classes are placed without Scala version directory in SBT 
build specifically.
It seems causing side effects that are dependent on that path. See, for 
example, {{git grep \-r "target/scala\-"}}.

To minimise the side effects, we should disable crossPaths only in GitHub 
Actions build for now.


> Enable crossPaths back to prevent side effects
> --
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep \-r "target/scala\-"}}.
> To minimise the side effects, we should enable crossPaths back for now.
> This is actually an issue in Jenkins jobs as well. See 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32408) Enable crossPaths back to prevent side effects

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32408:
-
Description: 
After SPARK-32245, crossPaths was disabled in SBT build to run the Junit tests 
per project properly.
This is correct change since we're not doing the cross build in SBT.

Now, the intermediate classes are placed without Scala version directory in SBT 
build specifically.
It seems causing side effects that are dependent on that path. See, for 
example, {{git grep \-r "target/scala\-"}}.

To minimise the side effects, we should enable crossPaths back for now.

This is actually an issue in Jenkins jobs as well. See 
https://github.com/apache/spark/pull/29205 for the analysis made.

  was:
After SPARK-32245, crossPaths was disabled in SBT build to run the Junit tests 
per project properly.
This is correct change since we're not doing the cross build in SBT.

Now, the intermediate classes are placed without Scala version directory in SBT 
build specifically.
It seems causing side effects that are dependent on that path. See, for 
example, {{git grep \-r "target/scala\-"}}.

To minimise the side effects, we should enable crossPaths back for now.

This is actually an issue in Jenkins jobs as well. See 


> Enable crossPaths back to prevent side effects
> --
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep \-r "target/scala\-"}}.
> To minimise the side effects, we should enable crossPaths back for now.
> This is actually an issue in Jenkins jobs as well. See 
> https://github.com/apache/spark/pull/29205 for the analysis made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32408) Enable crossPaths back to prevent side effects

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32408:
-
Summary: Enable crossPaths back to prevent side effects  (was: Disable 
crossPaths only in GitHub Actions to prevent side effects)

> Enable crossPaths back to prevent side effects
> --
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep \-r "target/scala\-"}}.
> To minimise the side effects, we should disable crossPaths only in GitHub 
> Actions build for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32422) Don't skip pandas UDF tests in IntegratedUDFTestUtils

2020-07-23 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-32422:


 Summary: Don't skip pandas UDF tests in IntegratedUDFTestUtils
 Key: SPARK-32422
 URL: https://issues.apache.org/jira/browse/SPARK-32422
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


Currently, pandas UDF test cases are being skipped as below:

{code}
[info] - udf/postgreSQL/udf-case.sql - Scalar Pandas UDF is skipped because 
pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! IGNORED !!!
[info] - udf/postgreSQL/udf-select_having.sql - Scala UDF (2 seconds, 327 
milliseconds)
[info] - udf/postgreSQL/udf-select_having.sql - Regular Python UDF (3 seconds, 
656 milliseconds)
[info] - udf/postgreSQL/udf-select_having.sql - Scalar Pandas UDF is skipped 
because pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! 
IGNORED !!!
[info] - udf/postgreSQL/udf-select_implicit.sql - Scala UDF (6 seconds, 769 
milliseconds)
[info] - udf/postgreSQL/udf-select_implicit.sql - Regular Python UDF (10 
seconds, 487 milliseconds)
[info] - udf/postgreSQL/udf-select_implicit.sql - Scalar Pandas UDF is skipped 
because pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! 
IGNORED !!!
[info] - udf/postgreSQL/udf-aggregates_part3.sql - Scala UDF (119 milliseconds)
[info] - udf/postgreSQL/udf-aggregates_part3.sql - Regular Python UDF (229 
milliseconds)
[info] - udf/postgreSQL/udf-aggregates_part3.sql - Scalar Pandas UDF is skipped 
because pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! 
IGNORED !!!
[info] - udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF (2 seconds, 376 
milliseconds)
[info] - udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF (2 
seconds, 449 milliseconds)
[info] - udf/postgreSQL/udf-aggregates_part2.sql - Scalar Pandas UDF is skipped 
because pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! 
IGNORED !!!
[info] - udf/postgreSQL/udf-aggregates_part1.sql - Scala UDF (3 seconds, 634 
milliseconds)
[info] - udf/postgreSQL/udf-aggregates_part1.sql - Regular Python UDF (5 
seconds, 899 milliseconds)
[info] - udf/postgreSQL/udf-aggregates_part1.sql - Scalar Pandas UDF is skipped 
because pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! 
IGNORED !!!
{code}

in GitBub Actions. We should test them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32422) Don't skip pandas UDF tests in IntegratedUDFTestUtils

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32422:


Assignee: (was: Apache Spark)

> Don't skip pandas UDF tests in IntegratedUDFTestUtils
> -
>
> Key: SPARK-32422
> URL: https://issues.apache.org/jira/browse/SPARK-32422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, pandas UDF test cases are being skipped as below:
> {code}
> [info] - udf/postgreSQL/udf-case.sql - Scalar Pandas UDF is skipped because 
> pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! IGNORED 
> !!!
> [info] - udf/postgreSQL/udf-select_having.sql - Scala UDF (2 seconds, 327 
> milliseconds)
> [info] - udf/postgreSQL/udf-select_having.sql - Regular Python UDF (3 
> seconds, 656 milliseconds)
> [info] - udf/postgreSQL/udf-select_having.sql - Scalar Pandas UDF is skipped 
> because pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! 
> IGNORED !!!
> [info] - udf/postgreSQL/udf-select_implicit.sql - Scala UDF (6 seconds, 769 
> milliseconds)
> [info] - udf/postgreSQL/udf-select_implicit.sql - Regular Python UDF (10 
> seconds, 487 milliseconds)
> [info] - udf/postgreSQL/udf-select_implicit.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> [info] - udf/postgreSQL/udf-aggregates_part3.sql - Scala UDF (119 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part3.sql - Regular Python UDF (229 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part3.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> [info] - udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF (2 seconds, 376 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF (2 
> seconds, 449 milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part2.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> [info] - udf/postgreSQL/udf-aggregates_part1.sql - Scala UDF (3 seconds, 634 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part1.sql - Regular Python UDF (5 
> seconds, 899 milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part1.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> {code}
> in GitBub Actions. We should test them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32422) Don't skip pandas UDF tests in IntegratedUDFTestUtils

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32422:


Assignee: Apache Spark

> Don't skip pandas UDF tests in IntegratedUDFTestUtils
> -
>
> Key: SPARK-32422
> URL: https://issues.apache.org/jira/browse/SPARK-32422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, pandas UDF test cases are being skipped as below:
> {code}
> [info] - udf/postgreSQL/udf-case.sql - Scalar Pandas UDF is skipped because 
> pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! IGNORED 
> !!!
> [info] - udf/postgreSQL/udf-select_having.sql - Scala UDF (2 seconds, 327 
> milliseconds)
> [info] - udf/postgreSQL/udf-select_having.sql - Regular Python UDF (3 
> seconds, 656 milliseconds)
> [info] - udf/postgreSQL/udf-select_having.sql - Scalar Pandas UDF is skipped 
> because pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! 
> IGNORED !!!
> [info] - udf/postgreSQL/udf-select_implicit.sql - Scala UDF (6 seconds, 769 
> milliseconds)
> [info] - udf/postgreSQL/udf-select_implicit.sql - Regular Python UDF (10 
> seconds, 487 milliseconds)
> [info] - udf/postgreSQL/udf-select_implicit.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> [info] - udf/postgreSQL/udf-aggregates_part3.sql - Scala UDF (119 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part3.sql - Regular Python UDF (229 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part3.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> [info] - udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF (2 seconds, 376 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF (2 
> seconds, 449 milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part2.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> [info] - udf/postgreSQL/udf-aggregates_part1.sql - Scala UDF (3 seconds, 634 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part1.sql - Regular Python UDF (5 
> seconds, 899 milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part1.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> {code}
> in GitBub Actions. We should test them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32422) Don't skip pandas UDF tests in IntegratedUDFTestUtils

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164125#comment-17164125
 ] 

Apache Spark commented on SPARK-32422:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29217

> Don't skip pandas UDF tests in IntegratedUDFTestUtils
> -
>
> Key: SPARK-32422
> URL: https://issues.apache.org/jira/browse/SPARK-32422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, pandas UDF test cases are being skipped as below:
> {code}
> [info] - udf/postgreSQL/udf-case.sql - Scalar Pandas UDF is skipped because 
> pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! IGNORED 
> !!!
> [info] - udf/postgreSQL/udf-select_having.sql - Scala UDF (2 seconds, 327 
> milliseconds)
> [info] - udf/postgreSQL/udf-select_having.sql - Regular Python UDF (3 
> seconds, 656 milliseconds)
> [info] - udf/postgreSQL/udf-select_having.sql - Scalar Pandas UDF is skipped 
> because pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! 
> IGNORED !!!
> [info] - udf/postgreSQL/udf-select_implicit.sql - Scala UDF (6 seconds, 769 
> milliseconds)
> [info] - udf/postgreSQL/udf-select_implicit.sql - Regular Python UDF (10 
> seconds, 487 milliseconds)
> [info] - udf/postgreSQL/udf-select_implicit.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> [info] - udf/postgreSQL/udf-aggregates_part3.sql - Scala UDF (119 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part3.sql - Regular Python UDF (229 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part3.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> [info] - udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF (2 seconds, 376 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF (2 
> seconds, 449 milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part2.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> [info] - udf/postgreSQL/udf-aggregates_part1.sql - Scala UDF (3 seconds, 634 
> milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part1.sql - Regular Python UDF (5 
> seconds, 899 milliseconds)
> [info] - udf/postgreSQL/udf-aggregates_part1.sql - Scalar Pandas UDF is 
> skipped because pyspark,pandas and/or pyarrow were not available in 
> [python3.6]. !!! IGNORED !!!
> {code}
> in GitBub Actions. We should test them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32237) Cannot resolve column when put hint in the views of common table expression

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32237.
-
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29201
[https://github.com/apache/spark/pull/29201]

> Cannot resolve column when put hint in the views of common table expression
> ---
>
> Key: SPARK-32237
> URL: https://issues.apache.org/jira/browse/SPARK-32237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Hadoop-2.7.7
> Hive-2.3.6
> Spark-3.0.0
>Reporter: Kernel Force
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Suppose we have a table:
> {code:sql}
> CREATE TABLE DEMO_DATA (
>   ID VARCHAR(10),
>   NAME VARCHAR(10),
>   BATCH VARCHAR(10),
>   TEAM VARCHAR(1)
> ) STORED AS PARQUET;
> {code}
> and some data in it:
> {code:sql}
> 0: jdbc:hive2://HOSTNAME:1> SELECT T.* FROM DEMO_DATA T;
> +---+-+-+-+
> | t.id  | t.name  |   t.batch   | t.team  |
> +---+-+-+-+
> | 1 | mike| 2020-07-08  | A   |
> | 2 | john| 2020-07-07  | B   |
> | 3 | rose| 2020-07-06  | B   |
> | |
> +---+-+-+-+
> {code}
> If I put query hint in va or vb and run it in spark-shell:
> {code:sql}
> sql("""
> WITH VA AS
>  (SELECT T.ID, T.NAME, T.BATCH, T.TEAM 
> FROM DEMO_DATA T WHERE T.TEAM = 'A'),
> VB AS
>  (SELECT /*+ REPARTITION(3) */ T.ID, T.NAME, T.BATCH, T.TEAM
> FROM VA T)
> SELECT T.ID, T.NAME, T.BATCH, T.TEAM 
>   FROM VB T
> """).show
> {code}
> In Spark-2.4.4 it works fine.
>  But in Spark-3.0.0, it throws AnalysisException with Unrecognized hint 
> warning:
> {code:scala}
> 20/07/09 13:51:14 WARN analysis.HintErrorLogger: Unrecognized hint: 
> REPARTITION(3)
> org.apache.spark.sql.AnalysisException: cannot resolve '`T.ID`' given input 
> columns: [T.BATCH, T.ID, T.NAME, T.TEAM]; line 8 pos 7;
> 'Project ['T.ID, 'T.NAME, 'T.BATCH, 'T.TEAM]
> +- SubqueryAlias T
>+- SubqueryAlias VB
>   +- Project [ID#0, NAME#1, BATCH#2, TEAM#3]
>  +- SubqueryAlias T
> +- SubqueryAlias VA
>+- Project [ID#0, NAME#1, BATCH#2, TEAM#3]
>   +- Filter (TEAM#3 = A)
>  +- SubqueryAlias T
> +- SubqueryAlias spark_catalog.default.demo_data
>+- Relation[ID#0,NAME#1,BATCH#2,TEAM#3] parquet
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:143)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:129)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:134)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:139)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:139)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:106)
>   at 
>

[jira] [Assigned] (SPARK-32237) Cannot resolve column when put hint in the views of common table expression

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32237:
---

Assignee: Lantao Jin

> Cannot resolve column when put hint in the views of common table expression
> ---
>
> Key: SPARK-32237
> URL: https://issues.apache.org/jira/browse/SPARK-32237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Hadoop-2.7.7
> Hive-2.3.6
> Spark-3.0.0
>Reporter: Kernel Force
>Assignee: Lantao Jin
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Suppose we have a table:
> {code:sql}
> CREATE TABLE DEMO_DATA (
>   ID VARCHAR(10),
>   NAME VARCHAR(10),
>   BATCH VARCHAR(10),
>   TEAM VARCHAR(1)
> ) STORED AS PARQUET;
> {code}
> and some data in it:
> {code:sql}
> 0: jdbc:hive2://HOSTNAME:1> SELECT T.* FROM DEMO_DATA T;
> +---+-+-+-+
> | t.id  | t.name  |   t.batch   | t.team  |
> +---+-+-+-+
> | 1 | mike| 2020-07-08  | A   |
> | 2 | john| 2020-07-07  | B   |
> | 3 | rose| 2020-07-06  | B   |
> | |
> +---+-+-+-+
> {code}
> If I put query hint in va or vb and run it in spark-shell:
> {code:sql}
> sql("""
> WITH VA AS
>  (SELECT T.ID, T.NAME, T.BATCH, T.TEAM 
> FROM DEMO_DATA T WHERE T.TEAM = 'A'),
> VB AS
>  (SELECT /*+ REPARTITION(3) */ T.ID, T.NAME, T.BATCH, T.TEAM
> FROM VA T)
> SELECT T.ID, T.NAME, T.BATCH, T.TEAM 
>   FROM VB T
> """).show
> {code}
> In Spark-2.4.4 it works fine.
>  But in Spark-3.0.0, it throws AnalysisException with Unrecognized hint 
> warning:
> {code:scala}
> 20/07/09 13:51:14 WARN analysis.HintErrorLogger: Unrecognized hint: 
> REPARTITION(3)
> org.apache.spark.sql.AnalysisException: cannot resolve '`T.ID`' given input 
> columns: [T.BATCH, T.ID, T.NAME, T.TEAM]; line 8 pos 7;
> 'Project ['T.ID, 'T.NAME, 'T.BATCH, 'T.TEAM]
> +- SubqueryAlias T
>+- SubqueryAlias VB
>   +- Project [ID#0, NAME#1, BATCH#2, TEAM#3]
>  +- SubqueryAlias T
> +- SubqueryAlias VA
>+- Project [ID#0, NAME#1, BATCH#2, TEAM#3]
>   +- Filter (TEAM#3 = A)
>  +- SubqueryAlias T
> +- SubqueryAlias spark_catalog.default.demo_data
>+- Relation[ID#0,NAME#1,BATCH#2,TEAM#3] parquet
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:143)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:129)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:134)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:139)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:139)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:140)
>   at 
>

[jira] [Assigned] (SPARK-32420) Add handling for unique key in non-codegen hash join

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32420:


Assignee: (was: Apache Spark)

> Add handling for unique key in non-codegen hash join
> 
>
> Key: SPARK-32420
> URL: https://issues.apache.org/jira/browse/SPARK-32420
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> `HashRelation` has two separate code paths for unique key look up and 
> non-unique key look up E.g. in its subclass 
> `UnsafeHashedRelation`([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177]),
>  unique key look up is more efficient as it does not have extra 
> `Iterator[UnsafeRow].hasNext()/next()` overhead per row.
> `BroadcastHashJoinExec` has handled unique key vs non-unique key separately 
> in code-gen path 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321]).
>  But the non-codegen path for broadcast hash join and shuffled hash join do 
> not separate it yet, so adding the support here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32420) Add handling for unique key in non-codegen hash join

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32420:


Assignee: Apache Spark

> Add handling for unique key in non-codegen hash join
> 
>
> Key: SPARK-32420
> URL: https://issues.apache.org/jira/browse/SPARK-32420
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Trivial
>
> `HashRelation` has two separate code paths for unique key look up and 
> non-unique key look up E.g. in its subclass 
> `UnsafeHashedRelation`([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177]),
>  unique key look up is more efficient as it does not have extra 
> `Iterator[UnsafeRow].hasNext()/next()` overhead per row.
> `BroadcastHashJoinExec` has handled unique key vs non-unique key separately 
> in code-gen path 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321]).
>  But the non-codegen path for broadcast hash join and shuffled hash join do 
> not separate it yet, so adding the support here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32420) Add handling for unique key in non-codegen hash join

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164113#comment-17164113
 ] 

Apache Spark commented on SPARK-32420:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/29216

> Add handling for unique key in non-codegen hash join
> 
>
> Key: SPARK-32420
> URL: https://issues.apache.org/jira/browse/SPARK-32420
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> `HashRelation` has two separate code paths for unique key look up and 
> non-unique key look up E.g. in its subclass 
> `UnsafeHashedRelation`([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177]),
>  unique key look up is more efficient as it does not have extra 
> `Iterator[UnsafeRow].hasNext()/next()` overhead per row.
> `BroadcastHashJoinExec` has handled unique key vs non-unique key separately 
> in code-gen path 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321]).
>  But the non-codegen path for broadcast hash join and shuffled hash join do 
> not separate it yet, so adding the support here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32421) Add code-gen for shuffled hash join

2020-07-23 Thread Cheng Su (Jira)

Cheng Su created SPARK-32421:


 Summary: Add code-gen for shuffled hash join
 Key: SPARK-32421
 URL: https://issues.apache.org/jira/browse/SPARK-32421
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Cheng Su


We added shuffled hash join codegen internally in our fork, and seeing obvious 
improvement in benchmark compared to current non-codegen code path. Creating 
this Jira to add this support. Shuffled hash join codegen is very similar to 
broadcast hash join codegen. So this is a simple change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32421) Add code-gen for shuffled hash join

2020-07-23 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164108#comment-17164108
 ] 

Cheng Su commented on SPARK-32421:
--

Will raise a PR in a couple of days.

> Add code-gen for shuffled hash join
> ---
>
> Key: SPARK-32421
> URL: https://issues.apache.org/jira/browse/SPARK-32421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> We added shuffled hash join codegen internally in our fork, and seeing 
> obvious improvement in benchmark compared to current non-codegen code path. 
> Creating this Jira to add this support. Shuffled hash join codegen is very 
> similar to broadcast hash join codegen. So this is a simple change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32420) Add handling for unique key in non-codegen hash join

2020-07-23 Thread Cheng Su (Jira)

Cheng Su created SPARK-32420:


 Summary: Add handling for unique key in non-codegen hash join
 Key: SPARK-32420
 URL: https://issues.apache.org/jira/browse/SPARK-32420
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Cheng Su


`HashRelation` has two separate code paths for unique key look up and 
non-unique key look up E.g. in its subclass 
`UnsafeHashedRelation`([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177]),
 unique key look up is more efficient as it does not have extra 
`Iterator[UnsafeRow].hasNext()/next()` overhead per row.

`BroadcastHashJoinExec` has handled unique key vs non-unique key separately in 
code-gen path 
([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321]).
 But the non-codegen path for broadcast hash join and shuffled hash join do not 
separate it yet, so adding the support here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31547) Upgrade Genjavadoc to 0.16

2020-07-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31547:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Improvement)

> Upgrade Genjavadoc to 0.16
> --
>
> Key: SPARK-31547
> URL: https://issues.apache.org/jira/browse/SPARK-31547
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32363) Flaky pip installation test in Jenkins

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164102#comment-17164102
 ] 

Apache Spark commented on SPARK-32363:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29215

> Flaky pip installation test in Jenkins
> --
>
> Key: SPARK-32363
> URL: https://issues.apache.org/jira/browse/SPARK-32363
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently pip packaging test is flaky in Jenkins:
> {code}
> Installing collected packages: py4j, pyspark
>   Attempting uninstall: py4j
> Found existing installation: py4j 0.10.9
> Uninstalling py4j-0.10.9:
>   Successfully uninstalled py4j-0.10.9
>   Attempting uninstall: pyspark
> Found existing installation: pyspark 3.1.0.dev0
> ERROR: Exception:
> Traceback (most recent call last):
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/base_command.py",
>  line 188, in _main
> status = self.run(options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/req_command.py",
>  line 185, in wrapper
> return func(self, options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/commands/install.py",
>  line 407, in run
> use_user_site=options.use_user_site,
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/__init__.py",
>  line 64, in install_given_reqs
> auto_confirm=True
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_install.py",
>  line 675, in uninstall
> uninstalled_pathset = UninstallPathSet.from_dist(dist)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py",
>  line 545, in from_dist
> link_pointer, dist.project_name, dist.location)
> AssertionError: Egg-link 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python does 
> not match installed location of pyspark (at 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/python)
> Cleaning up temporary directory - /tmp/tmp.GX6lHKLHZK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32264) More resources in Github Actions

2020-07-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164097#comment-17164097
 ] 

Dongjoon Hyun commented on SPARK-32264:
---

Thank you, [~holden] and [~hyukjin.kwon].

> More resources in Github Actions
> 
>
> Key: SPARK-32264
> URL: https://issues.apache.org/jira/browse/SPARK-32264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Holden Karau
>Priority: Major
>
> We are currently using free version of Github Actions which only allows 20 
> concurrent jobs. This is not enough in the heavy development in Apache spark.
> We should have a way to allocate more resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31525) Inconsistent result of df.head(1) and df.head()

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164086#comment-17164086
 ] 

Apache Spark commented on SPARK-31525:
--

User 'tianshizz' has created a pull request for this issue:
https://github.com/apache/spark/pull/29214

> Inconsistent result of df.head(1) and df.head()
> ---
>
> Key: SPARK-31525
> URL: https://issues.apache.org/jira/browse/SPARK-31525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Joshua Hendinata
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In this line 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339],
>  if you are calling `df.head()` and dataframe is empty, it will return *None*
> but if you are calling `df.head(1)` and dataframe is empty, it will return 
> *empty list* instead.
> This particular behaviour is not consistent and can create confusion. 
> Especially when you are calling `len(df.head())` which will throw an 
> exception for empty dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31525) Inconsistent result of df.head(1) and df.head()

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31525:


Assignee: (was: Apache Spark)

> Inconsistent result of df.head(1) and df.head()
> ---
>
> Key: SPARK-31525
> URL: https://issues.apache.org/jira/browse/SPARK-31525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Joshua Hendinata
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In this line 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339],
>  if you are calling `df.head()` and dataframe is empty, it will return *None*
> but if you are calling `df.head(1)` and dataframe is empty, it will return 
> *empty list* instead.
> This particular behaviour is not consistent and can create confusion. 
> Especially when you are calling `len(df.head())` which will throw an 
> exception for empty dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31525) Inconsistent result of df.head(1) and df.head()

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164084#comment-17164084
 ] 

Apache Spark commented on SPARK-31525:
--

User 'tianshizz' has created a pull request for this issue:
https://github.com/apache/spark/pull/29214

> Inconsistent result of df.head(1) and df.head()
> ---
>
> Key: SPARK-31525
> URL: https://issues.apache.org/jira/browse/SPARK-31525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Joshua Hendinata
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In this line 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339],
>  if you are calling `df.head()` and dataframe is empty, it will return *None*
> but if you are calling `df.head(1)` and dataframe is empty, it will return 
> *empty list* instead.
> This particular behaviour is not consistent and can create confusion. 
> Especially when you are calling `len(df.head())` which will throw an 
> exception for empty dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31525) Inconsistent result of df.head(1) and df.head()

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31525:


Assignee: Apache Spark

> Inconsistent result of df.head(1) and df.head()
> ---
>
> Key: SPARK-31525
> URL: https://issues.apache.org/jira/browse/SPARK-31525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Joshua Hendinata
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In this line 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339],
>  if you are calling `df.head()` and dataframe is empty, it will return *None*
> but if you are calling `df.head(1)` and dataframe is empty, it will return 
> *empty list* instead.
> This particular behaviour is not consistent and can create confusion. 
> Especially when you are calling `len(df.head())` which will throw an 
> exception for empty dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14820) Reduce shuffle data by pushing filter toward storage

2020-07-23 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-14820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164074#comment-17164074
 ] 

Yuming Wang commented on SPARK-14820:
-

It seems the issue fixed by SPARK-31705.

> Reduce shuffle data by pushing filter toward storage
> 
>
> Key: SPARK-14820
> URL: https://issues.apache.org/jira/browse/SPARK-14820
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ali Tootoonchian
>Priority: Trivial
>  Labels: bulk-closed
> Attachments: Reduce Shuffle Data by pushing filter toward storage.pdf
>
>
> SQL query planner can have intelligence to push down filter commands towards 
> the storage layer. If we optimize the query planner such that the IO to the 
> storage is reduced at the cost of running multiple filters (i.e., compute), 
> this should be desirable when the system is IO bound.
> Proven analysis and example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32264) More resources in Github Actions

2020-07-23 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164070#comment-17164070
 ] 

Hyukjin Kwon commented on SPARK-32264:
--

Thank you [~holden]!

> More resources in Github Actions
> 
>
> Key: SPARK-32264
> URL: https://issues.apache.org/jira/browse/SPARK-32264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Holden Karau
>Priority: Major
>
> We are currently using free version of Github Actions which only allows 20 
> concurrent jobs. This is not enough in the heavy development in Apache spark.
> We should have a way to allocate more resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32419) Leverage Conda environment at pip packaging test in GitHub Actions

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32419:


Assignee: Apache Spark

> Leverage Conda environment at pip packaging test in GitHub Actions
> --
>
> Key: SPARK-32419
> URL: https://issues.apache.org/jira/browse/SPARK-32419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> If you take a close look for GitHub Actions log:
> {code:java}
>  Installing dist into virtual env
> Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz
> Collecting py4j==0.10.9
>  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
> Using legacy setup.py install for pyspark, since package 'wheel' is not 
> installed.
> Installing collected packages: py4j, pyspark
>  Running setup.py install for pyspark: started
>  Running setup.py install for pyspark: finished with status 'done'
> Successfully installed py4j-0.10.9 pyspark-3.1.0.dev0
> ...
> Installing dist into virtual env
> Obtaining file:///home/runner/work/spark/spark/python
> Collecting py4j==0.10.9
>  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
> Installing collected packages: py4j, pyspark
>  Attempting uninstall: py4j
>  Found existing installation: py4j 0.10.9
>  Uninstalling py4j-0.10.9:
>  Successfully uninstalled py4j-0.10.9
>  Attempting uninstall: pyspark
>  Found existing installation: pyspark 3.1.0.dev0
>  Uninstalling pyspark-3.1.0.dev0:
>  Successfully uninstalled pyspark-3.1.0.dev0
>  Running setup.py develop for pyspark
> Successfully installed py4j-0.10.9 pyspark
> {code}
> It looks not properly using conda as it removes and re-installs again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32419) Leverage Conda environment at pip packaging test in GitHub Actions

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32419:


Assignee: (was: Apache Spark)

> Leverage Conda environment at pip packaging test in GitHub Actions
> --
>
> Key: SPARK-32419
> URL: https://issues.apache.org/jira/browse/SPARK-32419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> If you take a close look for GitHub Actions log:
> {code:java}
>  Installing dist into virtual env
> Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz
> Collecting py4j==0.10.9
>  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
> Using legacy setup.py install for pyspark, since package 'wheel' is not 
> installed.
> Installing collected packages: py4j, pyspark
>  Running setup.py install for pyspark: started
>  Running setup.py install for pyspark: finished with status 'done'
> Successfully installed py4j-0.10.9 pyspark-3.1.0.dev0
> ...
> Installing dist into virtual env
> Obtaining file:///home/runner/work/spark/spark/python
> Collecting py4j==0.10.9
>  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
> Installing collected packages: py4j, pyspark
>  Attempting uninstall: py4j
>  Found existing installation: py4j 0.10.9
>  Uninstalling py4j-0.10.9:
>  Successfully uninstalled py4j-0.10.9
>  Attempting uninstall: pyspark
>  Found existing installation: pyspark 3.1.0.dev0
>  Uninstalling pyspark-3.1.0.dev0:
>  Successfully uninstalled pyspark-3.1.0.dev0
>  Running setup.py develop for pyspark
> Successfully installed py4j-0.10.9 pyspark
> {code}
> It looks not properly using conda as it removes and re-installs again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32419) Leverage Conda environment at pip packaging test in GitHub Actions

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164067#comment-17164067
 ] 

Apache Spark commented on SPARK-32419:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29212

> Leverage Conda environment at pip packaging test in GitHub Actions
> --
>
> Key: SPARK-32419
> URL: https://issues.apache.org/jira/browse/SPARK-32419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> If you take a close look for GitHub Actions log:
> {code:java}
>  Installing dist into virtual env
> Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz
> Collecting py4j==0.10.9
>  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
> Using legacy setup.py install for pyspark, since package 'wheel' is not 
> installed.
> Installing collected packages: py4j, pyspark
>  Running setup.py install for pyspark: started
>  Running setup.py install for pyspark: finished with status 'done'
> Successfully installed py4j-0.10.9 pyspark-3.1.0.dev0
> ...
> Installing dist into virtual env
> Obtaining file:///home/runner/work/spark/spark/python
> Collecting py4j==0.10.9
>  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
> Installing collected packages: py4j, pyspark
>  Attempting uninstall: py4j
>  Found existing installation: py4j 0.10.9
>  Uninstalling py4j-0.10.9:
>  Successfully uninstalled py4j-0.10.9
>  Attempting uninstall: pyspark
>  Found existing installation: pyspark 3.1.0.dev0
>  Uninstalling pyspark-3.1.0.dev0:
>  Successfully uninstalled pyspark-3.1.0.dev0
>  Running setup.py develop for pyspark
> Successfully installed py4j-0.10.9 pyspark
> {code}
> It looks not properly using conda as it removes and re-installs again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32264) More resources in Github Actions

2020-07-23 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164066#comment-17164066
 ] 

Holden Karau commented on SPARK-32264:
--

It's being routed inside of GitHub as of my last contact with them (9 days 
ago). I'll follow up end of the month if we don't hear back.

> More resources in Github Actions
> 
>
> Key: SPARK-32264
> URL: https://issues.apache.org/jira/browse/SPARK-32264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Holden Karau
>Priority: Major
>
> We are currently using free version of Github Actions which only allows 20 
> concurrent jobs. This is not enough in the heavy development in Apache spark.
> We should have a way to allocate more resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32419) Leverage Conda environment at pip packaging test in GitHub Actions

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32419:
-
Description: 
If you take a close look for GitHub Actions log:
{code:java}
 Installing dist into virtual env
Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz
Collecting py4j==0.10.9
 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Using legacy setup.py install for pyspark, since package 'wheel' is not 
installed.
Installing collected packages: py4j, pyspark
 Running setup.py install for pyspark: started
 Running setup.py install for pyspark: finished with status 'done'
Successfully installed py4j-0.10.9 pyspark-3.1.0.dev0

...

Installing dist into virtual env
Obtaining file:///home/runner/work/spark/spark/python
Collecting py4j==0.10.9
 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Installing collected packages: py4j, pyspark
 Attempting uninstall: py4j
 Found existing installation: py4j 0.10.9
 Uninstalling py4j-0.10.9:
 Successfully uninstalled py4j-0.10.9
 Attempting uninstall: pyspark
 Found existing installation: pyspark 3.1.0.dev0
 Uninstalling pyspark-3.1.0.dev0:
 Successfully uninstalled pyspark-3.1.0.dev0
 Running setup.py develop for pyspark
Successfully installed py4j-0.10.9 pyspark
{code}

It looks not properly using conda as it removes and re-installs again.

  was:
If you take a close look for GitHub Actions log
{code:java}
 Installing dist into virtual env
Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz
Collecting py4j==0.10.9
 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Using legacy setup.py install for pyspark, since package 'wheel' is not 
installed.
Installing collected packages: py4j, pyspark
 Running setup.py install for pyspark: started
 Running setup.py install for pyspark: finished with status 'done'
Successfully installed py4j-0.10.9 pyspark-3.1.0.dev0
...
Installing dist into virtual env
Obtaining file:///home/runner/work/spark/spark/python
Collecting py4j==0.10.9
 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Installing collected packages: py4j, pyspark
 Attempting uninstall: py4j
 Found existing installation: py4j 0.10.9
 Uninstalling py4j-0.10.9:
 Successfully uninstalled py4j-0.10.9
 Attempting uninstall: pyspark
 Found existing installation: pyspark 3.1.0.dev0
 Uninstalling pyspark-3.1.0.dev0:
 Successfully uninstalled pyspark-3.1.0.dev0
 Running setup.py develop for pyspark
Successfully installed py4j-0.10.9 pyspark{code}


> Leverage Conda environment at pip packaging test in GitHub Actions
> --
>
> Key: SPARK-32419
> URL: https://issues.apache.org/jira/browse/SPARK-32419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> If you take a close look for GitHub Actions log:
> {code:java}
>  Installing dist into virtual env
> Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz
> Collecting py4j==0.10.9
>  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
> Using legacy setup.py install for pyspark, since package 'wheel' is not 
> installed.
> Installing collected packages: py4j, pyspark
>  Running setup.py install for pyspark: started
>  Running setup.py install for pyspark: finished with status 'done'
> Successfully installed py4j-0.10.9 pyspark-3.1.0.dev0
> ...
> Installing dist into virtual env
> Obtaining file:///home/runner/work/spark/spark/python
> Collecting py4j==0.10.9
>  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
> Installing collected packages: py4j, pyspark
>  Attempting uninstall: py4j
>  Found existing installation: py4j 0.10.9
>  Uninstalling py4j-0.10.9:
>  Successfully uninstalled py4j-0.10.9
>  Attempting uninstall: pyspark
>  Found existing installation: pyspark 3.1.0.dev0
>  Uninstalling pyspark-3.1.0.dev0:
>  Successfully uninstalled pyspark-3.1.0.dev0
>  Running setup.py develop for pyspark
> Successfully installed py4j-0.10.9 pyspark
> {code}
> It looks not properly using conda as it removes and re-installs again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32419) Leverage Conda environment at pip packaging test in GitHub Actions

2020-07-23 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-32419:


 Summary: Leverage Conda environment at pip packaging test in 
GitHub Actions
 Key: SPARK-32419
 URL: https://issues.apache.org/jira/browse/SPARK-32419
 Project: Spark
  Issue Type: Sub-task
  Components: Build, PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


If you take a close look for GitHub Actions log
{code:java}
 Installing dist into virtual env
Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz
Collecting py4j==0.10.9
 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Using legacy setup.py install for pyspark, since package 'wheel' is not 
installed.
Installing collected packages: py4j, pyspark
 Running setup.py install for pyspark: started
 Running setup.py install for pyspark: finished with status 'done'
Successfully installed py4j-0.10.9 pyspark-3.1.0.dev0
...
Installing dist into virtual env
Obtaining file:///home/runner/work/spark/spark/python
Collecting py4j==0.10.9
 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Installing collected packages: py4j, pyspark
 Attempting uninstall: py4j
 Found existing installation: py4j 0.10.9
 Uninstalling py4j-0.10.9:
 Successfully uninstalled py4j-0.10.9
 Attempting uninstall: pyspark
 Found existing installation: pyspark 3.1.0.dev0
 Uninstalling pyspark-3.1.0.dev0:
 Successfully uninstalled pyspark-3.1.0.dev0
 Running setup.py develop for pyspark
Successfully installed py4j-0.10.9 pyspark{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32264) More resources in Github Actions

2020-07-23 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164062#comment-17164062
 ] 

Hyukjin Kwon commented on SPARK-32264:
--

I'll assign this to [~holden] for now .. since she's being the contact point 
for now.

> More resources in Github Actions
> 
>
> Key: SPARK-32264
> URL: https://issues.apache.org/jira/browse/SPARK-32264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Holden Karau
>Priority: Major
>
> We are currently using free version of Github Actions which only allows 20 
> concurrent jobs. This is not enough in the heavy development in Apache spark.
> We should have a way to allocate more resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32264) More resources in Github Actions

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32264:


Assignee: Holden Karau

> More resources in Github Actions
> 
>
> Key: SPARK-32264
> URL: https://issues.apache.org/jira/browse/SPARK-32264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Holden Karau
>Priority: Major
>
> We are currently using free version of Github Actions which only allows 20 
> concurrent jobs. This is not enough in the heavy development in Apache spark.
> We should have a way to allocate more resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32415) Enable JSON tests for the allowNonNumericNumbers option

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32415.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29207
[https://github.com/apache/spark/pull/29207]

> Enable JSON tests for the allowNonNumericNumbers option
> ---
>
> Key: SPARK-32415
> URL: https://issues.apache.org/jira/browse/SPARK-32415
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, 2 tests in JsonParsingOptionsSuite for the allowNonNumericNumbers 
> option are ignored. The tests can be enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32415) Enable JSON tests for the allowNonNumericNumbers option

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32415:


Assignee: Maxim Gekk

> Enable JSON tests for the allowNonNumericNumbers option
> ---
>
> Key: SPARK-32415
> URL: https://issues.apache.org/jira/browse/SPARK-32415
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, 2 tests in JsonParsingOptionsSuite for the allowNonNumericNumbers 
> option are ignored. The tests can be enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32398) Upgrade to scalatest 3.2.0 for Scala 2.13.3 compatibility

2020-07-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32398.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29196
[https://github.com/apache/spark/pull/29196]

> Upgrade to scalatest 3.2.0 for Scala 2.13.3 compatibility
> -
>
> Key: SPARK-32398
> URL: https://issues.apache.org/jira/browse/SPARK-32398
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Spark Core, SQL, Structured Streaming, Tests
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.1.0
>
>
> We'll need to update to scalatest 3.2.0 in order to pick up the fix here, 
> which fixes an incompatibility with Scala 2.13.3:
> https://github.com/scalatest/scalatest/commit/7c89416aa9f3e7f2730a343ad6d3bdcff65809de
> That's a big change unfortunately - 3.1 / 3.2 reorganized many classes. 
> Fortunately it's just like import updates in 100 files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32364) Use CaseInsensitiveMap for DataFrameReader/Writer options

2020-07-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32364:
--
Fix Version/s: 2.4.7

> Use CaseInsensitiveMap for DataFrameReader/Writer options
> -
>
> Key: SPARK-32364
> URL: https://issues.apache.org/jira/browse/SPARK-32364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Girish A Pandit
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> When a user have multiple options like path, paTH, and PATH for the same key 
> path, option/options is non-deterministic because extraOptions is HashMap. 
> This issue aims to use *CaseInsensitiveMap* instead of *HashMap* to fix this 
> bug fundamentally.
> {code}
> spark.read
>   .option("paTh", "1")
>   .option("PATH", "2")
>   .option("Path", "3")
>   .option("patH", "4")
>   .load("5")
> ...
> org.apache.spark.sql.AnalysisException:
> Path does not exist: file:/.../1;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32418) Flaky test: org.apache.spark.DistributedSuite.caching in memory, serialized, replicated (encryption = off)

2020-07-23 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-32418:


 Summary: Flaky test: org.apache.spark.DistributedSuite.caching in 
memory, serialized, replicated (encryption = off)
 Key: SPARK-32418
 URL: https://issues.apache.org/jira/browse/SPARK-32418
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Jungtaek Lim


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126432/testReport/

{noformat}
org.apache.spark.DistributedSuite.caching in memory, serialized, replicated 
(encryption = off)
 Error Details
org.scalatest.exceptions.TestFailedException: 9 did not equal 10
 Stack Trace
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 9 did not 
equal 10
at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
at 
org.apache.spark.DistributedSuite.testCaching(DistributedSuite.scala:181)
at 
org.apache.spark.DistributedSuite.$anonfun$testCaching$1(DistributedSuite.scala:162)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)
at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)
at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)
at 
org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458)
at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite.run(Suite.scala:1124)
at org.scalatest.Suite.run$(Suite.scala:1106)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:518)
at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:59)
at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:59)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail:

[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2020-07-23 Thread Wing Yew Poon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163987#comment-17163987
 ] 

Wing Yew Poon commented on SPARK-31693:
---

I'm seeing a problem with the .m2 cache on amp-jenkins-worker-06. In 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126370/console
{noformat}
[EnvInject] - Loading node environment variables.
Building remotely on amp-jenkins-worker-06 (centos spark-test) in workspace 
/home/jenkins/workspace/SparkPullRequestBuilder
...

Running build tests

exec: curl -s -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
exec: curl -s -L https://downloads.lightbend.com/scala/2.12.10/scala-2.12.10.tgz
Using `mvn` from path: 
/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.3/bin/mvn
Using `mvn` from path: 
/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.3/bin/mvn
Performing Maven install for hadoop-2.7-hive-1.2
Using `mvn` from path: 
/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.3/bin/mvn
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-install-plugin:3.0.0-M1:install (default-cli) on 
project spark-yarn_2.12: ArtifactInstallerException: Failed to install metadata 
org.apache.spark:spark-yarn_2.12/maven-metadata.xml: Could not parse metadata 
/home/jenkins/.m2/repository/org/apache/spark/spark-yarn_2.12/maven-metadata-local.xml:
 in epilog non whitespace content is not allowed but got t (position: END_TAG 
seen ...\nt... @13:2) -> [Help 1]
{noformat}


> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31197) Exit the executor once all tasks & migrations are finished

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163983#comment-17163983
 ] 

Apache Spark commented on SPARK-31197:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29211

> Exit the executor once all tasks & migrations are finished
> --
>
> Key: SPARK-31197
> URL: https://issues.apache.org/jira/browse/SPARK-31197
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31197) Exit the executor once all tasks & migrations are finished

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163981#comment-17163981
 ] 

Apache Spark commented on SPARK-31197:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29211

> Exit the executor once all tasks & migrations are finished
> --
>
> Key: SPARK-31197
> URL: https://issues.apache.org/jira/browse/SPARK-31197
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32054) Flaky test: org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.Fallback Parquet V2 to V1

2020-07-23 Thread Wing Yew Poon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163976#comment-17163976
 ] 

Wing Yew Poon commented on SPARK-32054:
---

org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.Fallback Parquet 
V2 to V1 failed in 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126425; 
however, earlier, it passed in 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126354/ for 
the same PR (no changes between the runs).

> Flaky test: 
> org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.Fallback Parquet 
> V2 to V1
> --
>
> Key: SPARK-32054
> URL: https://issues.apache.org/jira/browse/SPARK-32054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124364/testReport/org.apache.spark.sql.connector/FileDataSourceV2FallBackSuite/Fallback_Parquet_V2_to_V1/
> {code:java}
> Error Message
> org.scalatest.exceptions.TestFailedException: 
> ArrayBuffer((collect,Relation[id#387495L] parquet ), 
> (save,InsertIntoHadoopFsRelationCommand 
> file:/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776,
>  false, Parquet, Map(path -> 
> /home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776),
>  ErrorIfExists, [id] +- Range (0, 10, step=1, splits=Some(2)) )) had length 2 
> instead of expected length 1
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> ArrayBuffer((collect,Relation[id#387495L] parquet
> ), (save,InsertIntoHadoopFsRelationCommand 
> file:/home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776,
>  false, Parquet, Map(path -> 
> /home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-fe4d8028-b7c5-406d-9c5a-59c96e98f776),
>  ErrorIfExists, [id]
> +- Range (0, 10, step=1, splits=Some(2))
> )) had length 2 instead of expected length 1
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
>   at 
> org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$22(FileDataSourceV2FallBackSuite.scala:180)
>   at 
> org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$22$adapted(FileDataSourceV2FallBackSuite.scala:176)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
>   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:34)
>   at 
> org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$21(FileDataSourceV2FallBackSuite.scala:176)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38)
>   at 
> org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(FileDataSourceV2FallBackSuite.scala:85)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244)
>   at 
> org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.withSQLConf(FileDataSourceV2FallBackSuite.scala:85)
>   at 
> org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$20(FileDataSourceV2FallBackSuite.scala:158)
>   at 
> org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$20$adapted(FileDataSourceV2FallBackSuite.scala:157)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.connector.FileDataSourceV2FallBackSuite.$anonfun$new$19(FileDataSourceV2FallBackSuite.scala:157)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at

[jira] [Created] (SPARK-32417) Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already running task which is going to cache data succeeds on a decommissioned executor

2020-07-23 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-32417:
-

 Summary: Flaky test: 
BlockManagerDecommissionIntegrationSuite.verify that an already running task 
which is going to cache data succeeds on a decommissioned executor
 Key: SPARK-32417
 URL: https://issues.apache.org/jira/browse/SPARK-32417
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 3.1.0
Reporter: Gabor Somogyi


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126424/testReport/
{code:java}
Error Message
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 2759 times over 30.001772248 
seconds. Last failure message: Map() was empty We should have a block that has 
been on multiple BMs in rdds:  
ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 
replicas),56,0)), SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
localhost, 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 
replicas),56,0)), SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
localhost, 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))) from: 
ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 
amp-jenkins-worker-05.amp, 45854, None),broadcast_1_piece0,StorageLevel(memory, 
1 replicas),2695,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 37968, 
None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, 42805, 
None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 42041, 
None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 37968, 
None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 42041, 
None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 37968, 
None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0)))  but 
instead we got:  Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1).
Stacktrace
sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 2759 times over 30.001772248 
seconds. Last failure message: Map() was empty We should have a block that has 
been on multiple BMs in rdds:
 ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 
replicas),56,0)), SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
localhost, 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 
replicas),56,0)), SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
localhost, 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))) from:
ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 
amp-jenkins-worker-05.amp, 45854, None),broadcast_1_piece0,StorageLevel(memory, 
1 replicas),2695,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 37968, 
None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, 42805, 
None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 42041, 
None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 37968, 
None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 42041, 
None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 37968, 
None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0)))
 but instead we got:
 Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1).
at 
org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
at 
org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.eventually(BlockManagerDecommissionIntegrationSuite.scala:33)
at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:308)
at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:307)
at

[jira] [Created] (SPARK-32416) Flaky test: SparkContextSuite.Cancelling stages/jobs with custom reasons

2020-07-23 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-32416:
-

 Summary: Flaky test: SparkContextSuite.Cancelling stages/jobs with 
custom reasons
 Key: SPARK-32416
 URL: https://issues.apache.org/jira/browse/SPARK-32416
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 3.1.0
Reporter: Gabor Somogyi


[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126424/testReport/]
{code:java}
Error Message
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 1293 times over 20.01121311 
seconds. Last failure message: 1 did not equal 0.
Stacktrace
sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 1293 times over 20.01121311 
seconds. Last failure message: 1 did not equal 0.
at 
org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
at 
org.apache.spark.SparkContextSuite.eventually(SparkContextSuite.scala:49)
at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337)
at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336)
at 
org.apache.spark.SparkContextSuite.eventually(SparkContextSuite.scala:49)
at 
org.apache.spark.SparkContextSuite.$anonfun$new$58(SparkContextSuite.scala:607)
at 
org.apache.spark.SparkContextSuite.$anonfun$new$58$adapted(SparkContextSuite.scala:566)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.SparkContextSuite.$anonfun$new$57(SparkContextSuite.scala:566)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)
at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)
at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)
at 
org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458)
at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite.run(Suite.scala:1124)
at org.scalatest.Suite.run$(Suite.scala:1106)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:518)
at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:59)
at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:59)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
at

[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163864#comment-17163864
 ] 

Apache Spark commented on SPARK-24497:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29210

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32364) Use CaseInsensitiveMap for DataFrameReader/Writer options

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163857#comment-17163857
 ] 

Apache Spark commented on SPARK-32364:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29209

> Use CaseInsensitiveMap for DataFrameReader/Writer options
> -
>
> Key: SPARK-32364
> URL: https://issues.apache.org/jira/browse/SPARK-32364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Girish A Pandit
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> When a user have multiple options like path, paTH, and PATH for the same key 
> path, option/options is non-deterministic because extraOptions is HashMap. 
> This issue aims to use *CaseInsensitiveMap* instead of *HashMap* to fix this 
> bug fundamentally.
> {code}
> spark.read
>   .option("paTh", "1")
>   .option("PATH", "2")
>   .option("Path", "3")
>   .option("patH", "4")
>   .load("5")
> ...
> org.apache.spark.sql.AnalysisException:
> Path does not exist: file:/.../1;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17053) Spark ignores hive.exec.drop.ignorenonexistent=true option

2020-07-23 Thread Jeffrey E Rodriguez (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-17053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163821#comment-17163821
 ] 

Jeffrey E  Rodriguez commented on SPARK-17053:
--

[~rxin] as  esoteric as "hive.exec.drop.ignorenonexisten" configurations may 
look. There is much code written by Hive developers that would not work in 
Spark. This will inhibit migration to Spark using Spark SQL. Give the case of 
having to touch their Hive code and moving to Spark some Hive developers would 
choose to not touch their code and stay with Hive. 

 [~dongjoon] fix looks good and fixes the issue,  it is my opinion as an Apache 
committer that it should get a chance to make it.

> Spark ignores hive.exec.drop.ignorenonexistent=true option
> --
>
> Key: SPARK-17053
> URL: https://issues.apache.org/jira/browse/SPARK-17053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gokhan Civan
>Priority: Major
>
> In version 1.6.1, the following does not throw an exception:
> create table a as select 1; drop table a; drop table a;
> In version 2.0.0, the second drop fails; this is not compatible with Hive.
> The same problem exists for views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-07-23 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-31418.
---
Fix Version/s: 3.1.0
 Assignee: Venkata krishnan Sowrirajan
   Resolution: Fixed

> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> 
>
> Key: SPARK-31418
> URL: https://issues.apache.org/jira/browse/SPARK-31418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.1.0
>
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-32413) Guidance for my project

2020-07-23 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler closed SPARK-32413.


> Guidance for my project 
> 
>
> Key: SPARK-32413
> URL: https://issues.apache.org/jira/browse/SPARK-32413
> Project: Spark
>  Issue Type: Brainstorming
>  Components: PySpark, Spark Core, SparkR
>Affects Versions: 3.0.0
>Reporter: Suat Toksoz
>Priority: Minor
>
> hi,
> I am planning to get-read elasticsearch index continuously, and put that data 
> on Data Frame and group that data, search and create an alert. I like to 
> write my code in python.
> For this purpose, what should I use, spark, jupter, pyspark... 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32413) Guidance for my project

2020-07-23 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-32413.
--
Resolution: Not A Problem

Hi [~stoksoz] , this type of discussion is more appropriate for the mailing 
list, see [https://spark.apache.org/community.html] on how to subscribe

> Guidance for my project 
> 
>
> Key: SPARK-32413
> URL: https://issues.apache.org/jira/browse/SPARK-32413
> Project: Spark
>  Issue Type: Brainstorming
>  Components: PySpark, Spark Core, SparkR
>Affects Versions: 3.0.0
>Reporter: Suat Toksoz
>Priority: Minor
>
> hi,
> I am planning to get-read elasticsearch index continuously, and put that data 
> on Data Frame and group that data, search and create an alert. I like to 
> write my code in python.
> For this purpose, what should I use, spark, jupter, pyspark... 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32372) "Resolved attribute(s) XXX missing" after dudup conflict references

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163781#comment-17163781
 ] 

Apache Spark commented on SPARK-32372:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29208

> "Resolved attribute(s) XXX missing" after dudup conflict references
> ---
>
> Key: SPARK-32372
> URL: https://issues.apache.org/jira/browse/SPARK-32372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.4, 2.4.6, 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> {code:java}
> // case class Person(id: Int, name: String, age: Int)
> sql("SELECT name, avg(age) as avg_age FROM person GROUP BY 
> name").createOrReplaceTempView("person_a")
> sql("SELECT p1.name, p2.avg_age FROM person p1 JOIN person_a p2 ON p1.name = 
> p2.name").createOrReplaceTempView("person_b")
> sql("SELECT * FROM person_a UNION SELECT * FROM person_b")   
> .createOrReplaceTempView("person_c")
> sql("SELECT p1.name, p2.avg_age FROM person_c p1 JOIN person_c p2 ON p1.name 
> = p2.name").show
> {code}
> error:
> {code:java}
> [info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: 
> Resolved attribute(s) avg_age#235 missing from name#233,avg_age#231 in 
> operator !Project [name#233, avg_age#235]. Attribute(s) with the same name 
> appear in the operation: avg_age. Please check if the right attribute(s) are 
> used.;;
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32280) AnalysisException thrown when query contains several JOINs

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163780#comment-17163780
 ] 

Apache Spark commented on SPARK-32280:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29208

> AnalysisException thrown when query contains several JOINs
> --
>
> Key: SPARK-32280
> URL: https://issues.apache.org/jira/browse/SPARK-32280
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: David Lindelöf
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> I've come across a curious {{AnalysisException}} thrown in one of my SQL 
> queries, even though the SQL appears legitimate. I was able to reduce it to 
> this example:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> spark.sql('SELECT 1 AS id').createOrReplaceTempView('A')
> spark.sql('''
>  SELECT id,
>  'foo' AS kind
>  FROM A''').createOrReplaceTempView('B')
> spark.sql('''
>  SELECT l.id
>  FROM B AS l
>  JOIN B AS r
>  ON l.kind = r.kind''').createOrReplaceTempView('C')
> spark.sql('''
>  SELECT 0
>  FROM (
>SELECT *
>FROM B
>JOIN C
>USING (id))
>  JOIN (
>SELECT *
>FROM B
>JOIN C
>USING (id))
>  USING (id)''')
> {code}
> Running this yields the following error:
> {code}
>  py4j.protocol.Py4JJavaError: An error occurred while calling o20.sql.
> : org.apache.spark.sql.AnalysisException: Resolved attribute(s) kind#11 
> missing from id#10,kind#2,id#7,kind#5 in operator !Join Inner, (kind#11 = 
> kind#5). Attribute(s) with the same name appear in the operation: kind. 
> Please check if the right attribute(s) are used.;;
> Project [0 AS 0#15]
> +- Project [id#0, kind#2, kind#11]
>+- Join Inner, (id#0 = id#14)
>   :- SubqueryAlias `__auto_generated_subquery_name`
>   :  +- Project [id#0, kind#2]
>   : +- Project [id#0, kind#2]
>   :+- Join Inner, (id#0 = id#9)
>   :   :- SubqueryAlias `b`
>   :   :  +- Project [id#0, foo AS kind#2]
>   :   : +- SubqueryAlias `a`
>   :   :+- Project [1 AS id#0]
>   :   :   +- OneRowRelation
>   :   +- SubqueryAlias `c`
>   :  +- Project [id#9]
>   : +- Join Inner, (kind#2 = kind#5)
>   ::- SubqueryAlias `l`
>   ::  +- SubqueryAlias `b`
>   :: +- Project [id#9, foo AS kind#2]
>   ::+- SubqueryAlias `a`
>   ::   +- Project [1 AS id#9]
>   ::  +- OneRowRelation
>   :+- SubqueryAlias `r`
>   :   +- SubqueryAlias `b`
>   :  +- Project [id#7, foo AS kind#5]
>   : +- SubqueryAlias `a`
>   :+- Project [1 AS id#7]
>   :   +- OneRowRelation
>   +- SubqueryAlias `__auto_generated_subquery_name`
>  +- Project [id#14, kind#11]
> +- Project [id#14, kind#11]
>+- Join Inner, (id#14 = id#10)
>   :- SubqueryAlias `b`
>   :  +- Project [id#14, foo AS kind#11]
>   : +- SubqueryAlias `a`
>   :+- Project [1 AS id#14]
>   :   +- OneRowRelation
>   +- SubqueryAlias `c`
>  +- Project [id#10]
> +- !Join Inner, (kind#11 = kind#5)
>:- SubqueryAlias `l`
>:  +- SubqueryAlias `b`
>: +- Project [id#10, foo AS kind#2]
>:+- SubqueryAlias `a`
>:   +- Project [1 AS id#10]
>:  +- OneRowRelation
>+- SubqueryAlias `r`
>   +- SubqueryAlias `b`
>  +- Project [id#7, foo AS kind#5]
> +- SubqueryAlias `a`
>+- Project [1 AS id#7]
>   +- OneRowRelation
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:369)
>   at 
>

[jira] [Commented] (SPARK-32411) GPU Cluster Fail

2020-07-23 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163742#comment-17163742
 ] 

L. C. Hsieh commented on SPARK-32411:
-

I think it is because the configs.

"spark.task.resource.gpu.amount  2" means each task requires 2 gpus, but 
"spark.executor.resource.gpu.amount 1" specifies each executor has only 1 gpu. 
So task scheduler cannot find an executor which meets task requirement.

> GPU Cluster Fail
> 
>
> Key: SPARK-32411
> URL: https://issues.apache.org/jira/browse/SPARK-32411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 3.0.0
> Environment: Ihave a Apache Spark 3.0 cluster consisting of machines 
> with multiple nvidia-gpus and I connect my jupyter notebook to the cluster 
> using pyspark,
>Reporter: Vinh Tran
>Priority: Major
>
> I'm having a difficult time getting a GPU cluster started on Apache Spark 
> 3.0. It was hard to find documentation on this, but I stumbled on a NVIDIA 
> github page for Rapids which suggested the following additional edits to the 
> spark-defaults.conf:
> {code:java}
> spark.task.resource.gpu.amount 0.25
> spark.executor.resource.gpu.discoveryScript 
> ./usr/local/spark/getGpusResources.sh{code}
> I have a Apache Spark 3.0 cluster consisting of machines with multiple 
> nvidia-gpus and I connect my jupyter notebook to the cluster using pyspark, 
> however it results in the following error: 
> {code:java}
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : org.apache.spark.SparkException: You must specify an amount for gpu
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseResourceRequest$1(ResourceUtils.scala:142)
>   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseResourceRequest(ResourceUtils.scala:142)
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseAllResourceRequests$1(ResourceUtils.scala:159)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:159)
>   at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:238)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>   at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> After this, I then tried adding another line to the conf per the instructions 
> which results in no errors, however when I log in to the Web UI at 
> localhost:8080, under Running Applications, the state remains at waiting.
> {code:java}
> spark.task.resource.gpu.amount  2
> spark.executor.resource.gpu.discoveryScript
> ./usr/local/spark/getGpusResources.sh
> spark.executor.resource.gpu.amount  1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32411) GPU Cluster Fail

2020-07-23 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-32411.
-
Resolution: Not A Problem

> GPU Cluster Fail
> 
>
> Key: SPARK-32411
> URL: https://issues.apache.org/jira/browse/SPARK-32411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 3.0.0
> Environment: Ihave a Apache Spark 3.0 cluster consisting of machines 
> with multiple nvidia-gpus and I connect my jupyter notebook to the cluster 
> using pyspark,
>Reporter: Vinh Tran
>Priority: Major
>
> I'm having a difficult time getting a GPU cluster started on Apache Spark 
> 3.0. It was hard to find documentation on this, but I stumbled on a NVIDIA 
> github page for Rapids which suggested the following additional edits to the 
> spark-defaults.conf:
> {code:java}
> spark.task.resource.gpu.amount 0.25
> spark.executor.resource.gpu.discoveryScript 
> ./usr/local/spark/getGpusResources.sh{code}
> I have a Apache Spark 3.0 cluster consisting of machines with multiple 
> nvidia-gpus and I connect my jupyter notebook to the cluster using pyspark, 
> however it results in the following error: 
> {code:java}
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : org.apache.spark.SparkException: You must specify an amount for gpu
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseResourceRequest$1(ResourceUtils.scala:142)
>   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseResourceRequest(ResourceUtils.scala:142)
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseAllResourceRequests$1(ResourceUtils.scala:159)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:159)
>   at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:238)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>   at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> After this, I then tried adding another line to the conf per the instructions 
> which results in no errors, however when I log in to the Web UI at 
> localhost:8080, under Running Applications, the state remains at waiting.
> {code:java}
> spark.task.resource.gpu.amount  2
> spark.executor.resource.gpu.discoveryScript
> ./usr/local/spark/getGpusResources.sh
> spark.executor.resource.gpu.amount  1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32415) Enable JSON tests for the allowNonNumericNumbers option

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32415:


Assignee: Apache Spark

> Enable JSON tests for the allowNonNumericNumbers option
> ---
>
> Key: SPARK-32415
> URL: https://issues.apache.org/jira/browse/SPARK-32415
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently, 2 tests in JsonParsingOptionsSuite for the allowNonNumericNumbers 
> option are ignored. The tests can be enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32415) Enable JSON tests for the allowNonNumericNumbers option

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32415:


Assignee: (was: Apache Spark)

> Enable JSON tests for the allowNonNumericNumbers option
> ---
>
> Key: SPARK-32415
> URL: https://issues.apache.org/jira/browse/SPARK-32415
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, 2 tests in JsonParsingOptionsSuite for the allowNonNumericNumbers 
> option are ignored. The tests can be enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32415) Enable JSON tests for the allowNonNumericNumbers option

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163720#comment-17163720
 ] 

Apache Spark commented on SPARK-32415:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29207

> Enable JSON tests for the allowNonNumericNumbers option
> ---
>
> Key: SPARK-32415
> URL: https://issues.apache.org/jira/browse/SPARK-32415
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, 2 tests in JsonParsingOptionsSuite for the allowNonNumericNumbers 
> option are ignored. The tests can be enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32415) Enable JSON tests for the allowNonNumericNumbers option

2020-07-23 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-32415:
---
Summary: Enable JSON tests for the allowNonNumericNumbers option  (was: 
Enable JSON tests from the allowNonNumericNumbers option)

> Enable JSON tests for the allowNonNumericNumbers option
> ---
>
> Key: SPARK-32415
> URL: https://issues.apache.org/jira/browse/SPARK-32415
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, 2 tests in JsonParsingOptionsSuite for the allowNonNumericNumbers 
> option are ignored. The tests can be enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32415) Enable JSON tests from the allowNonNumericNumbers option

2020-07-23 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-32415:
--

 Summary: Enable JSON tests from the allowNonNumericNumbers option
 Key: SPARK-32415
 URL: https://issues.apache.org/jira/browse/SPARK-32415
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Currently, 2 tests in JsonParsingOptionsSuite for the allowNonNumericNumbers 
option are ignored. The tests can be enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32414) pyspark crashes in cluster mode with kafka structured streaming

2020-07-23 Thread cyrille cazenave (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cyrille cazenave updated SPARK-32414:
-
Attachment: spark.py

> pyspark crashes in cluster mode with kafka structured streaming
> ---
>
> Key: SPARK-32414
> URL: https://issues.apache.org/jira/browse/SPARK-32414
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: * spark version 3.0.0 from mac brew
>  * kubernetes Kind 18+
>  * kafka cluster: strimzi/kafka:0.18.0-kafka-2.5.0
>  * kafka package: org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0
>Reporter: cyrille cazenave
>Priority: Major
> Attachments: fulllogs.txt, spark.py
>
>
> Hello,
> {{I have been trying to run a pyspark script on Spark on Kubernetes and I 
> have this error that crashed the application:}}
> {{java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance 
> of org.apache.spark.rdd.MapPartitionsRDD)}}
>  
> I followed those steps:
>  * for spark on kubernetes: 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html] (that 
> include building the image using docker-image-tool.sh on mac with -p flag)
>  * Tried to use the image by the dev on 
> GoogleCloudPlatform/spark-on-k8s-operator 
> (gcr.io/spark-operator/spark-py:v3.0.0) and have the same issue
>  * for kafka streaming: 
> [https://spark.apache.org/docs/3.0.0/structured-streaming-kafka-integration.html#deploying]
>  * {{When running the script manually in a jupyter notebook 
> (jupyter/pyspark-notebook:latest, version 3.0.0) in local mode (with 
> PYSPARK_SUBMIT_ARGS=--packages 
> org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 pyspark-shell) it ran 
> without issue}}
>  * the command ran from the laptop is:
> spark-submit --master 
> k8s://[https://127.0.0.1:53979|https://127.0.0.1:53979/] --name spark-pi 
> --deploy-mode cluster --packages 
> org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 --conf 
> spark.kubernetes.container.image=fifoosab/pytest:3.0.0.dev0 --conf 
> spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf 
> spark.kubernetes.executor.request.cores=1 --conf 
> spark.kubernetes.driver.request.cores=1 --conf 
> spark.kubernetes.container.image.pullPolicy=Always local:///usr/bin/spark.py
>  
> {{full logs on the error in the attachements}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32414) pyspark crashes in cluster mode with kafka structured streaming

2020-07-23 Thread cyrille cazenave (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cyrille cazenave updated SPARK-32414:
-
Description: 
Hello,

{{I have been trying to run a pyspark script on Spark on Kubernetes and I have 
this error that crashed the application:}}

{{java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD)}}

 

I followed those steps:
 * for spark on kubernetes: 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html] (that include 
building the image using docker-image-tool.sh on mac with -p flag)
 * Tried to use the image by the dev on 
GoogleCloudPlatform/spark-on-k8s-operator 
(gcr.io/spark-operator/spark-py:v3.0.0) and have the same issue
 * for kafka streaming: 
[https://spark.apache.org/docs/3.0.0/structured-streaming-kafka-integration.html#deploying]
 * {{When running the script manually in a jupyter notebook 
(jupyter/pyspark-notebook:latest, version 3.0.0) in local mode (with 
PYSPARK_SUBMIT_ARGS=--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 
pyspark-shell) it ran without issue}}
 * the command ran from the laptop is:

spark-submit --master k8s://[https://127.0.0.1:53979|https://127.0.0.1:53979/] 
--name spark-pi --deploy-mode cluster --packages 
org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 --conf 
spark.kubernetes.container.image=fifoosab/pytest:3.0.0.dev0 --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf 
spark.kubernetes.executor.request.cores=1 --conf 
spark.kubernetes.driver.request.cores=1 --conf 
spark.kubernetes.container.image.pullPolicy=Always local:///usr/bin/spark.py

 

{{full logs on the error in the attachements}}

  was:
Hello,

{{I have been trying to run a pyspark script on Spark on Kubernetes and I have 
this error that crashed the application:}}

{{java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD)}}

 

I followed those steps:
 * for spark on kubernetes: 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html] (that include 
building the image using docker-image-tool.sh on mac with -p flag)
 * Tried to use the image by the dev on 
GoogleCloudPlatform/spark-on-k8s-operator 
(gcr.io/spark-operator/spark-py:v3.0.0) and have the same issue
 * for kafka streaming: 
[https://spark.apache.org/docs/3.0.0/structured-streaming-kafka-integration.html#deploying]
 * {{When running the script manually in a jupyter notebook 
(jupyter/pyspark-notebook:latest, version 3.0.0) in local mode (with 
PYSPARK_SUBMIT_ARGS=--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 
pyspark-shell) it ran without issue}}
 * the command ran from the laptop is:

spark-submit --master k8s://https://127.0.0.1:53979 --name spark-pi 
--deploy-mode cluster --packages 
org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 --conf 
spark.kubernetes.container.image=fifoosab/pytest:3.0.0.dev0 --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf 
spark.kubernetes.executor.request.cores=1 --conf 
spark.kubernetes.driver.request.cores=1 --conf 
spark.kubernetes.container.image.pullPolicy=Always local:///usr/bin/spark.py

 

{{more logs on the error:}}
 \{{}}

{{20/07/23 14:26:08 INFO TaskSetManager: Lost task 1.3 in stage 1.0 (TID 11) on 
10.244.3.7, executor 1: java.lang.ClassCastException (cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD) [duplicate 11]}}
 {{20/07/23 14:26:08 ERROR TaskSetManager: Task 1 in stage 1.0 failed 4 times; 
aborting job}}
 {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Cancelling stage 1}}
 {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Killing all running tasks in stage 
1: Stage cancelled}}
 {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Stage 1 was cancelled}}
 {{20/07/23 14:26:08 INFO TaskSetManager: Lost task 3.3 in stage 1.0 (TID 13) 
on 10.244.3.7, executor 1: java.lang.ClassCastException (cannot assign instance 
of java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD) [duplicate 12]}}
 {{20/07/23 14:26:08 INFO DAGScheduler: ResultStage 1 (start at 
NativeMethodAccessorImpl.java:0) failed in 20.352 s due to Job aborted due to 
stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost 
task 1.3 in stage 1.0 (TID 11, 10.244.3.7, executor 1): 
java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD}}
 \{{ at

[jira] [Updated] (SPARK-32414) pyspark crashes in cluster mode with kafka structured streaming

2020-07-23 Thread cyrille cazenave (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cyrille cazenave updated SPARK-32414:
-
Attachment: fulllogs.txt

> pyspark crashes in cluster mode with kafka structured streaming
> ---
>
> Key: SPARK-32414
> URL: https://issues.apache.org/jira/browse/SPARK-32414
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: * spark version 3.0.0 from mac brew
>  * kubernetes Kind 18+
>  * kafka cluster: strimzi/kafka:0.18.0-kafka-2.5.0
>  * kafka package: org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0
>Reporter: cyrille cazenave
>Priority: Major
> Attachments: fulllogs.txt
>
>
> Hello,
> {{I have been trying to run a pyspark script on Spark on Kubernetes and I 
> have this error that crashed the application:}}
> {{java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance 
> of org.apache.spark.rdd.MapPartitionsRDD)}}
>  
> I followed those steps:
>  * for spark on kubernetes: 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html] (that 
> include building the image using docker-image-tool.sh on mac with -p flag)
>  * Tried to use the image by the dev on 
> GoogleCloudPlatform/spark-on-k8s-operator 
> (gcr.io/spark-operator/spark-py:v3.0.0) and have the same issue
>  * for kafka streaming: 
> [https://spark.apache.org/docs/3.0.0/structured-streaming-kafka-integration.html#deploying]
>  * {{When running the script manually in a jupyter notebook 
> (jupyter/pyspark-notebook:latest, version 3.0.0) in local mode (with 
> PYSPARK_SUBMIT_ARGS=--packages 
> org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 pyspark-shell) it ran 
> without issue}}
>  * the command ran from the laptop is:
> spark-submit --master k8s://https://127.0.0.1:53979 --name spark-pi 
> --deploy-mode cluster --packages 
> org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 --conf 
> spark.kubernetes.container.image=fifoosab/pytest:3.0.0.dev0 --conf 
> spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf 
> spark.kubernetes.executor.request.cores=1 --conf 
> spark.kubernetes.driver.request.cores=1 --conf 
> spark.kubernetes.container.image.pullPolicy=Always local:///usr/bin/spark.py
>  
> {{more logs on the error:}}
>  \{{}}
> {{20/07/23 14:26:08 INFO TaskSetManager: Lost task 1.3 in stage 1.0 (TID 11) 
> on 10.244.3.7, executor 1: java.lang.ClassCastException (cannot assign 
> instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance 
> of org.apache.spark.rdd.MapPartitionsRDD) [duplicate 11]}}
>  {{20/07/23 14:26:08 ERROR TaskSetManager: Task 1 in stage 1.0 failed 4 
> times; aborting job}}
>  {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Cancelling stage 1}}
>  {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Killing all running tasks in 
> stage 1: Stage cancelled}}
>  {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Stage 1 was cancelled}}
>  {{20/07/23 14:26:08 INFO TaskSetManager: Lost task 3.3 in stage 1.0 (TID 13) 
> on 10.244.3.7, executor 1: java.lang.ClassCastException (cannot assign 
> instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance 
> of org.apache.spark.rdd.MapPartitionsRDD) [duplicate 12]}}
>  {{20/07/23 14:26:08 INFO DAGScheduler: ResultStage 1 (start at 
> NativeMethodAccessorImpl.java:0) failed in 20.352 s due to Job aborted due to 
> stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost 
> task 1.3 in stage 1.0 (TID 11, 10.244.3.7, executor 1): 
> java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance 
> of org.apache.spark.rdd.MapPartitionsRDD}}
>  \{{ at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
>  \{{ at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
>  \{{ at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2350)}}
>  \{{ at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2268)}}
>  \{{ at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)}}
>  \{{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)}}
>  \{{ at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2344)}}
>  \{{ at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2268)}}
>  \{{ at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)}}
>  \{{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)}}
>  \{{

[jira] [Updated] (SPARK-32414) pyspark crashes in cluster mode with kafka structured streaming

2020-07-23 Thread cyrille cazenave (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cyrille cazenave updated SPARK-32414:
-
Description: 
Hello,

{{I have been trying to run a pyspark script on Spark on Kubernetes and I have 
this error that crashed the application:}}

{{java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD)}}

 

I followed those steps:
 * for spark on kubernetes: 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html] (that include 
building the image using docker-image-tool.sh on mac with -p flag)
 * Tried to use the image by the dev on 
GoogleCloudPlatform/spark-on-k8s-operator 
(gcr.io/spark-operator/spark-py:v3.0.0) and have the same issue
 * for kafka streaming: 
[https://spark.apache.org/docs/3.0.0/structured-streaming-kafka-integration.html#deploying]
 * {{When running the script manually in a jupyter notebook 
(jupyter/pyspark-notebook:latest, version 3.0.0) in local mode (with 
PYSPARK_SUBMIT_ARGS=--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 
pyspark-shell) it ran without issue}}
 * the command ran from the laptop is:

spark-submit --master k8s://https://127.0.0.1:53979 --name spark-pi 
--deploy-mode cluster --packages 
org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 --conf 
spark.kubernetes.container.image=fifoosab/pytest:3.0.0.dev0 --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf 
spark.kubernetes.executor.request.cores=1 --conf 
spark.kubernetes.driver.request.cores=1 --conf 
spark.kubernetes.container.image.pullPolicy=Always local:///usr/bin/spark.py

 

{{more logs on the error:}}
 \{{}}

{{20/07/23 14:26:08 INFO TaskSetManager: Lost task 1.3 in stage 1.0 (TID 11) on 
10.244.3.7, executor 1: java.lang.ClassCastException (cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD) [duplicate 11]}}
 {{20/07/23 14:26:08 ERROR TaskSetManager: Task 1 in stage 1.0 failed 4 times; 
aborting job}}
 {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Cancelling stage 1}}
 {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Killing all running tasks in stage 
1: Stage cancelled}}
 {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Stage 1 was cancelled}}
 {{20/07/23 14:26:08 INFO TaskSetManager: Lost task 3.3 in stage 1.0 (TID 13) 
on 10.244.3.7, executor 1: java.lang.ClassCastException (cannot assign instance 
of java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD) [duplicate 12]}}
 {{20/07/23 14:26:08 INFO DAGScheduler: ResultStage 1 (start at 
NativeMethodAccessorImpl.java:0) failed in 20.352 s due to Job aborted due to 
stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost 
task 1.3 in stage 1.0 (TID 11, 10.244.3.7, executor 1): 
java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD}}
 \{{ at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
 \{{ at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
 \{{ at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2350)}}
 \{{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2268)}}
 \{{ at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)}}
 \{{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)}}
 \{{ at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2344)}}
 \{{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2268)}}
 \{{ at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)}}
 \{{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)}}
 \{{ at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465)}}
 \{{ at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423)}}
 \{{ at 
scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)}}
 \{{ at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)}}
 \{{ at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
 \{{ at java.lang.reflect.Method.invoke(Method.java:498)}}
 \{{ at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)}}
 \{{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2235)}}
 \{{ at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)}}
 \{{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)}}
 \{{ at

[jira] [Commented] (SPARK-30648) Support filters pushdown in JSON datasource

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163690#comment-17163690
 ] 

Apache Spark commented on SPARK-30648:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29206

> Support filters pushdown in JSON datasource
> ---
>
> Key: SPARK-30648
> URL: https://issues.apache.org/jira/browse/SPARK-30648
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> * Implement the `SupportsPushDownFilters` interface in `JsonScanBuilder`
>  * Apply filters in JacksonParser
>  * Change API JacksonParser - return Option[InternalRow] from 
> `convertObject()` for root JSON fields.
>  * Update JSONBenchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32414) pyspark crashes in cluster mode with kafka structured streaming

2020-07-23 Thread cyrille cazenave (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cyrille cazenave updated SPARK-32414:
-
Description: 
Hello,

{{I have been trying to run a pyspark script on Spark on Kubernetes and I have 
this error that crashed the application:}}

{{java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD)}}

 

I followed those steps:
 * for spark on kubernetes: 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html] (that include 
building the image using docker-image-tool.sh on mac with -p flag)
 * Tried to use the image by the dev on 
GoogleCloudPlatform/spark-on-k8s-operator 
(gcr.io/spark-operator/spark-py:v3.0.0) and have the same issue
 * for kafka streaming: 
[https://spark.apache.org/docs/3.0.0/structured-streaming-kafka-integration.html#deploying]
 * {{When running the script manually in a jupyter notebook 
(jupyter/pyspark-notebook:latest, version 3.0.0) in local mode (with 
PYSPARK_SUBMIT_ARGS=--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 
pyspark-shell) it ran without issue}}
 * the command ran from the laptop is:

{{spark-submit --master 
k8s://[https://127.0.0.1:53979|https://127.0.0.1:53979/] --name spark-pi 
--deploy-mode cluster --packages 
org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 --conf 
spark.kubernetes.container.image=fifoosab/pytest:latest --conf 
spark.jars.ivy=/tmp --conf 
spark.kubernetes.driver.volumes.emptyDir.ivy.mount.path=/opt/spark/ivy --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf 
spark.kubernetes.container.image.pullPolicy=Always local:///usr/bin/spark.py}}

 

{{more logs on the error:}}
 \{{}}

{{20/07/23 14:26:08 INFO TaskSetManager: Lost task 1.3 in stage 1.0 (TID 11) on 
10.244.3.7, executor 1: java.lang.ClassCastException (cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD) [duplicate 11]}}
 {{20/07/23 14:26:08 ERROR TaskSetManager: Task 1 in stage 1.0 failed 4 times; 
aborting job}}
 {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Cancelling stage 1}}
 {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Killing all running tasks in stage 
1: Stage cancelled}}
 {{20/07/23 14:26:08 INFO TaskSchedulerImpl: Stage 1 was cancelled}}
 {{20/07/23 14:26:08 INFO TaskSetManager: Lost task 3.3 in stage 1.0 (TID 13) 
on 10.244.3.7, executor 1: java.lang.ClassCastException (cannot assign instance 
of java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD) [duplicate 12]}}
 {{20/07/23 14:26:08 INFO DAGScheduler: ResultStage 1 (start at 
NativeMethodAccessorImpl.java:0) failed in 20.352 s due to Job aborted due to 
stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost 
task 1.3 in stage 1.0 (TID 11, 10.244.3.7, executor 1): 
java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD}}
 \{{ at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
 \{{ at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
 \{{ at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2350)}}
 \{{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2268)}}
 \{{ at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)}}
 \{{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)}}
 \{{ at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2344)}}
 \{{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2268)}}
 \{{ at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)}}
 \{{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)}}
 \{{ at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465)}}
 \{{ at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423)}}
 \{{ at 
scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)}}
 \{{ at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)}}
 \{{ at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
 \{{ at java.lang.reflect.Method.invoke(Method.java:498)}}
 \{{ at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)}}
 \{{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2235)}}
 \{{ at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)}}
 \{{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)}}
 \{{ at

[jira] [Created] (SPARK-32414) pyspark crashes in cluster mode with kafka structured streaming

2020-07-23 Thread cyrille cazenave (Jira)

cyrille cazenave created SPARK-32414:


 Summary: pyspark crashes in cluster mode with kafka structured 
streaming
 Key: SPARK-32414
 URL: https://issues.apache.org/jira/browse/SPARK-32414
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
 Environment: * spark version 3.0.0 from mac brew
 * kubernetes Kind 18+
 * kafka cluster: strimzi/kafka:0.18.0-kafka-2.5.0
 * kafka package: org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0
Reporter: cyrille cazenave


{{Hello, }}

{{I have been trying to run a pyspark script on Spark on Kubernetes and I have 
this error that crashed the application:}}

{{java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD)}}

 

I followed those steps:
 * for spark on kubernetes: 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html] (that include 
building the image using docker-image-tool.sh on mac with -p flag)
 * Tried to use the image by the dev on 
GoogleCloudPlatform/spark-on-k8s-operator 
(gcr.io/spark-operator/spark-py:v3.0.0) and have the same issue
 * for kafka streaming: 
[https://spark.apache.org/docs/3.0.0/structured-streaming-kafka-integration.html#deploying]
 * {{When running the script manually in a jupyter notebook 
(jupyter/pyspark-notebook:latest, version 3.0.0) in local mode (with 
PYSPARK_SUBMIT_ARGS=--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 
pyspark-shell) it ran without issue}}
 * the command ran from the laptop is:

{{spark-submit --master k8s://https://127.0.0.1:53979 --name spark-pi 
--deploy-mode cluster --packages 
org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 --conf 
spark.kubernetes.container.image=fifoosab/pytest:latest --conf 
spark.jars.ivy=/tmp --conf 
spark.kubernetes.driver.volumes.emptyDir.ivy.mount.path=/opt/spark/ivy --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf 
spark.kubernetes.container.image.pullPolicy=Always local:///usr/bin/spark.py}}

 

{{more logs on the error:}}
{{}}

{{20/07/23 14:26:08 INFO TaskSetManager: Lost task 1.3 in stage 1.0 (TID 11) on 
10.244.3.7, executor 1: java.lang.ClassCastException (cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD) [duplicate 11]}}
{{20/07/23 14:26:08 ERROR TaskSetManager: Task 1 in stage 1.0 failed 4 times; 
aborting job}}
{{20/07/23 14:26:08 INFO TaskSchedulerImpl: Cancelling stage 1}}
{{20/07/23 14:26:08 INFO TaskSchedulerImpl: Killing all running tasks in stage 
1: Stage cancelled}}
{{20/07/23 14:26:08 INFO TaskSchedulerImpl: Stage 1 was cancelled}}
{{20/07/23 14:26:08 INFO TaskSetManager: Lost task 3.3 in stage 1.0 (TID 13) on 
10.244.3.7, executor 1: java.lang.ClassCastException (cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD) [duplicate 12]}}
{{20/07/23 14:26:08 INFO DAGScheduler: ResultStage 1 (start at 
NativeMethodAccessorImpl.java:0) failed in 20.352 s due to Job aborted due to 
stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost 
task 1.3 in stage 1.0 (TID 11, 10.244.3.7, executor 1): 
java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD}}
{{ at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
{{ at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
{{ at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2350)}}
{{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2268)}}
{{ at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)}}
{{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)}}
{{ at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2344)}}
{{ at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2268)}}
{{ at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)}}
{{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)}}
{{ at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465)}}
{{ at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423)}}
{{ at 
scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)}}
{{ at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)}}
{{ at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
{{ at java.lang.reflect.Method.invoke(Method.java:498)}}

[jira] [Resolved] (SPARK-32386) Fix temp view leaking in Structured Streaming tests

2020-07-23 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li resolved SPARK-32386.
-
Resolution: Won't Fix

In various suites, we need this temp view for checking results, e.g 
KafkaDontFailOnDataLossSuite.

> Fix temp view leaking in Structured Streaming tests
> ---
>
> Key: SPARK-32386
> URL: https://issues.apache.org/jira/browse/SPARK-32386
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32372) "Resolved attribute(s) XXX missing" after dudup conflict references

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32372.
-
Fix Version/s: 3.1.0
   3.0.1
 Assignee: wuyi
   Resolution: Fixed

> "Resolved attribute(s) XXX missing" after dudup conflict references
> ---
>
> Key: SPARK-32372
> URL: https://issues.apache.org/jira/browse/SPARK-32372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.4, 2.4.6, 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Blocker
> Fix For: 3.0.1, 3.1.0
>
>
> {code:java}
> // case class Person(id: Int, name: String, age: Int)
> sql("SELECT name, avg(age) as avg_age FROM person GROUP BY 
> name").createOrReplaceTempView("person_a")
> sql("SELECT p1.name, p2.avg_age FROM person p1 JOIN person_a p2 ON p1.name = 
> p2.name").createOrReplaceTempView("person_b")
> sql("SELECT * FROM person_a UNION SELECT * FROM person_b")   
> .createOrReplaceTempView("person_c")
> sql("SELECT p1.name, p2.avg_age FROM person_c p1 JOIN person_c p2 ON p1.name 
> = p2.name").show
> {code}
> error:
> {code:java}
> [info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: 
> Resolved attribute(s) avg_age#235 missing from name#233,avg_age#231 in 
> operator !Project [name#233, avg_age#235]. Attribute(s) with the same name 
> appear in the operation: avg_age. Please check if the right attribute(s) are 
> used.;;
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32374) Disallow setting properties when creating temporary views

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32374:
---

Assignee: Terry Kim  (was: Apache Spark)

> Disallow setting properties when creating temporary views
> -
>
> Key: SPARK-32374
> URL: https://issues.apache.org/jira/browse/SPARK-32374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, you can specify properties when creating a temporary view. 
> However, they are not used and SHOW TBLPROPERTIES always returns an empty 
> result on temporary views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32372) "Resolved attribute(s) XXX missing" after dudup conflict references

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32372:

Priority: Major  (was: Critical)

> "Resolved attribute(s) XXX missing" after dudup conflict references
> ---
>
> Key: SPARK-32372
> URL: https://issues.apache.org/jira/browse/SPARK-32372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.4, 2.4.6, 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> {code:java}
> // case class Person(id: Int, name: String, age: Int)
> sql("SELECT name, avg(age) as avg_age FROM person GROUP BY 
> name").createOrReplaceTempView("person_a")
> sql("SELECT p1.name, p2.avg_age FROM person p1 JOIN person_a p2 ON p1.name = 
> p2.name").createOrReplaceTempView("person_b")
> sql("SELECT * FROM person_a UNION SELECT * FROM person_b")   
> .createOrReplaceTempView("person_c")
> sql("SELECT p1.name, p2.avg_age FROM person_c p1 JOIN person_c p2 ON p1.name 
> = p2.name").show
> {code}
> error:
> {code:java}
> [info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: 
> Resolved attribute(s) avg_age#235 missing from name#233,avg_age#231 in 
> operator !Project [name#233, avg_age#235]. Attribute(s) with the same name 
> appear in the operation: avg_age. Please check if the right attribute(s) are 
> used.;;
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32372) "Resolved attribute(s) XXX missing" after dudup conflict references

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32372:

Priority: Critical  (was: Blocker)

> "Resolved attribute(s) XXX missing" after dudup conflict references
> ---
>
> Key: SPARK-32372
> URL: https://issues.apache.org/jira/browse/SPARK-32372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.4, 2.4.6, 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Critical
> Fix For: 3.0.1, 3.1.0
>
>
> {code:java}
> // case class Person(id: Int, name: String, age: Int)
> sql("SELECT name, avg(age) as avg_age FROM person GROUP BY 
> name").createOrReplaceTempView("person_a")
> sql("SELECT p1.name, p2.avg_age FROM person p1 JOIN person_a p2 ON p1.name = 
> p2.name").createOrReplaceTempView("person_b")
> sql("SELECT * FROM person_a UNION SELECT * FROM person_b")   
> .createOrReplaceTempView("person_c")
> sql("SELECT p1.name, p2.avg_age FROM person_c p1 JOIN person_c p2 ON p1.name 
> = p2.name").show
> {code}
> error:
> {code:java}
> [info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: 
> Resolved attribute(s) avg_age#235 missing from name#233,avg_age#231 in 
> operator !Project [name#233, avg_age#235]. Attribute(s) with the same name 
> appear in the operation: avg_age. Please check if the right attribute(s) are 
> used.;;
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32374) Disallow setting properties when creating temporary views

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32374.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29167
[https://github.com/apache/spark/pull/29167]

> Disallow setting properties when creating temporary views
> -
>
> Key: SPARK-32374
> URL: https://issues.apache.org/jira/browse/SPARK-32374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, you can specify properties when creating a temporary view. 
> However, they are not used and SHOW TBLPROPERTIES always returns an empty 
> result on temporary views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32280) AnalysisException thrown when query contains several JOINs

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32280:
---

Assignee: wuyi

> AnalysisException thrown when query contains several JOINs
> --
>
> Key: SPARK-32280
> URL: https://issues.apache.org/jira/browse/SPARK-32280
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: David Lindelöf
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> I've come across a curious {{AnalysisException}} thrown in one of my SQL 
> queries, even though the SQL appears legitimate. I was able to reduce it to 
> this example:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> spark.sql('SELECT 1 AS id').createOrReplaceTempView('A')
> spark.sql('''
>  SELECT id,
>  'foo' AS kind
>  FROM A''').createOrReplaceTempView('B')
> spark.sql('''
>  SELECT l.id
>  FROM B AS l
>  JOIN B AS r
>  ON l.kind = r.kind''').createOrReplaceTempView('C')
> spark.sql('''
>  SELECT 0
>  FROM (
>SELECT *
>FROM B
>JOIN C
>USING (id))
>  JOIN (
>SELECT *
>FROM B
>JOIN C
>USING (id))
>  USING (id)''')
> {code}
> Running this yields the following error:
> {code}
>  py4j.protocol.Py4JJavaError: An error occurred while calling o20.sql.
> : org.apache.spark.sql.AnalysisException: Resolved attribute(s) kind#11 
> missing from id#10,kind#2,id#7,kind#5 in operator !Join Inner, (kind#11 = 
> kind#5). Attribute(s) with the same name appear in the operation: kind. 
> Please check if the right attribute(s) are used.;;
> Project [0 AS 0#15]
> +- Project [id#0, kind#2, kind#11]
>+- Join Inner, (id#0 = id#14)
>   :- SubqueryAlias `__auto_generated_subquery_name`
>   :  +- Project [id#0, kind#2]
>   : +- Project [id#0, kind#2]
>   :+- Join Inner, (id#0 = id#9)
>   :   :- SubqueryAlias `b`
>   :   :  +- Project [id#0, foo AS kind#2]
>   :   : +- SubqueryAlias `a`
>   :   :+- Project [1 AS id#0]
>   :   :   +- OneRowRelation
>   :   +- SubqueryAlias `c`
>   :  +- Project [id#9]
>   : +- Join Inner, (kind#2 = kind#5)
>   ::- SubqueryAlias `l`
>   ::  +- SubqueryAlias `b`
>   :: +- Project [id#9, foo AS kind#2]
>   ::+- SubqueryAlias `a`
>   ::   +- Project [1 AS id#9]
>   ::  +- OneRowRelation
>   :+- SubqueryAlias `r`
>   :   +- SubqueryAlias `b`
>   :  +- Project [id#7, foo AS kind#5]
>   : +- SubqueryAlias `a`
>   :+- Project [1 AS id#7]
>   :   +- OneRowRelation
>   +- SubqueryAlias `__auto_generated_subquery_name`
>  +- Project [id#14, kind#11]
> +- Project [id#14, kind#11]
>+- Join Inner, (id#14 = id#10)
>   :- SubqueryAlias `b`
>   :  +- Project [id#14, foo AS kind#11]
>   : +- SubqueryAlias `a`
>   :+- Project [1 AS id#14]
>   :   +- OneRowRelation
>   +- SubqueryAlias `c`
>  +- Project [id#10]
> +- !Join Inner, (kind#11 = kind#5)
>:- SubqueryAlias `l`
>:  +- SubqueryAlias `b`
>: +- Project [id#10, foo AS kind#2]
>:+- SubqueryAlias `a`
>:   +- Project [1 AS id#10]
>:  +- OneRowRelation
>+- SubqueryAlias `r`
>   +- SubqueryAlias `b`
>  +- Project [id#7, foo AS kind#5]
> +- SubqueryAlias `a`
>+- Project [1 AS id#7]
>   +- OneRowRelation
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:369)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
>   at 
>

[jira] [Resolved] (SPARK-32280) AnalysisException thrown when query contains several JOINs

2020-07-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32280.
-
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29166
[https://github.com/apache/spark/pull/29166]

> AnalysisException thrown when query contains several JOINs
> --
>
> Key: SPARK-32280
> URL: https://issues.apache.org/jira/browse/SPARK-32280
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: David Lindelöf
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> I've come across a curious {{AnalysisException}} thrown in one of my SQL 
> queries, even though the SQL appears legitimate. I was able to reduce it to 
> this example:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> spark.sql('SELECT 1 AS id').createOrReplaceTempView('A')
> spark.sql('''
>  SELECT id,
>  'foo' AS kind
>  FROM A''').createOrReplaceTempView('B')
> spark.sql('''
>  SELECT l.id
>  FROM B AS l
>  JOIN B AS r
>  ON l.kind = r.kind''').createOrReplaceTempView('C')
> spark.sql('''
>  SELECT 0
>  FROM (
>SELECT *
>FROM B
>JOIN C
>USING (id))
>  JOIN (
>SELECT *
>FROM B
>JOIN C
>USING (id))
>  USING (id)''')
> {code}
> Running this yields the following error:
> {code}
>  py4j.protocol.Py4JJavaError: An error occurred while calling o20.sql.
> : org.apache.spark.sql.AnalysisException: Resolved attribute(s) kind#11 
> missing from id#10,kind#2,id#7,kind#5 in operator !Join Inner, (kind#11 = 
> kind#5). Attribute(s) with the same name appear in the operation: kind. 
> Please check if the right attribute(s) are used.;;
> Project [0 AS 0#15]
> +- Project [id#0, kind#2, kind#11]
>+- Join Inner, (id#0 = id#14)
>   :- SubqueryAlias `__auto_generated_subquery_name`
>   :  +- Project [id#0, kind#2]
>   : +- Project [id#0, kind#2]
>   :+- Join Inner, (id#0 = id#9)
>   :   :- SubqueryAlias `b`
>   :   :  +- Project [id#0, foo AS kind#2]
>   :   : +- SubqueryAlias `a`
>   :   :+- Project [1 AS id#0]
>   :   :   +- OneRowRelation
>   :   +- SubqueryAlias `c`
>   :  +- Project [id#9]
>   : +- Join Inner, (kind#2 = kind#5)
>   ::- SubqueryAlias `l`
>   ::  +- SubqueryAlias `b`
>   :: +- Project [id#9, foo AS kind#2]
>   ::+- SubqueryAlias `a`
>   ::   +- Project [1 AS id#9]
>   ::  +- OneRowRelation
>   :+- SubqueryAlias `r`
>   :   +- SubqueryAlias `b`
>   :  +- Project [id#7, foo AS kind#5]
>   : +- SubqueryAlias `a`
>   :+- Project [1 AS id#7]
>   :   +- OneRowRelation
>   +- SubqueryAlias `__auto_generated_subquery_name`
>  +- Project [id#14, kind#11]
> +- Project [id#14, kind#11]
>+- Join Inner, (id#14 = id#10)
>   :- SubqueryAlias `b`
>   :  +- Project [id#14, foo AS kind#11]
>   : +- SubqueryAlias `a`
>   :+- Project [1 AS id#14]
>   :   +- OneRowRelation
>   +- SubqueryAlias `c`
>  +- Project [id#10]
> +- !Join Inner, (kind#11 = kind#5)
>:- SubqueryAlias `l`
>:  +- SubqueryAlias `b`
>: +- Project [id#10, foo AS kind#2]
>:+- SubqueryAlias `a`
>:   +- Project [1 AS id#10]
>:  +- OneRowRelation
>+- SubqueryAlias `r`
>   +- SubqueryAlias `b`
>  +- Project [id#7, foo AS kind#5]
> +- SubqueryAlias `a`
>+- Project [1 AS id#7]
>   +- OneRowRelation
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:369)
>   at 
>

[jira] [Commented] (SPARK-32408) Disable crossPaths only in GitHub Actions to prevent side effects

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163595#comment-17163595
 ] 

Apache Spark commented on SPARK-32408:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29205

> Disable crossPaths only in GitHub Actions to prevent side effects
> -
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep \-r "target/scala\-"}}.
> To minimise the side effects, we should disable crossPaths only in GitHub 
> Actions build for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32408) Disable crossPaths only in GitHub Actions to prevent side effects

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32408:


Assignee: (was: Apache Spark)

> Disable crossPaths only in GitHub Actions to prevent side effects
> -
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep \-r "target/scala\-"}}.
> To minimise the side effects, we should disable crossPaths only in GitHub 
> Actions build for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32408) Disable crossPaths only in GitHub Actions to prevent side effects

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32408:


Assignee: Apache Spark

> Disable crossPaths only in GitHub Actions to prevent side effects
> -
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep \-r "target/scala\-"}}.
> To minimise the side effects, we should disable crossPaths only in GitHub 
> Actions build for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32334) Investigate commonizing Columnar and Row data transformations

2020-07-23 Thread Robert Joseph Evans (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163576#comment-17163576
 ] 

Robert Joseph Evans commented on SPARK-32334:
-

Row to columnar and columnar to row is mostly figured out.  There are some 
performance improvements that we could probably make in the row to columnar 
transition.  The issue is going to be with a columnar to columnar transition.  
Copying data from one columnar format in a performance way is a solvable 
problem, but we might need to special case a few things or do code generation 
if we cannot come up with a good common API.  The issue is going to be with the 
desired batch size.

parquet and orc output a batch size of 4096 rows by default, but each are 
separate configs.
in memory columnar storage wants 1 rows by default, but also has a hard 
coded soft limit of 4MB compressed.
The arrow config though is for a maximum size of 1 rows by default.

So I am thinking that we want `SparkPlan` to optionally specify a maximum batch 
size instead of a target size.  The row to columnar transition would just build 
up a batch until it hits the target size or the end of the input iterator.  The 
columnar to columnar transition is a little more complicated. It would have to 
copy out a range of rows from one batch into another batch.  This could mean in 
the worst case that we have one batch come in, in arrow format, but we need to 
copy it  to another batch, so that we can split it up into the target size. 

This should cover the use case for basic map like UDFs.

For UDFs like `FlatMapCoGroupsInPandasExec` there is no fixed batch size, and 
in fact it takes two iterators as input that are co-grouped together.  If we 
wanted an operator like this to do columnar processing we would have to be able 
to replicate all of that processing, but for columnar Arrow formatted data. 
This is starting to go beyond what I see as the scope of this JIRA and I would 
prefer to stick with just `MapInPandasExec`, `MapPartitionsInRWithArrowExec`, 
and `ArrowEvalPythonExec` for now.  In follow on work we can start to look at 
what it would take to support an ArrowBatchedGroupedIterator, and an 
ArrowBatchedCoGroupedIterator.

> Investigate commonizing Columnar and Row data transformations 
> --
>
> Key: SPARK-32334
> URL: https://issues.apache.org/jira/browse/SPARK-32334
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> We introduced more Columnar Support with SPARK-27396.
> With that we recognized that there is code that is doing very similar 
> transformations from ColumnarBatch or Arrow into InternalRow and vice versa.  
> For instance: 
> [https://github.com/apache/spark/blob/a4382f7fe1c36a51c64f460c6cb91e93470e0825/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L56-L58]
> [https://github.com/apache/spark/blob/a4382f7fe1c36a51c64f460c6cb91e93470e0825/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L389]
> We should investigate if we can commonize that code.
> We are also looking at making the internal caching serialization pluggable to 
> allow for different cache implementations. 
> ([https://github.com/apache/spark/pull/29067]). 
> It was recently brought up that we should investigate if using the data 
> source v2 api makes sense and is feasible for some of these transformations 
> to allow it to be easily extended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32364) Use CaseInsensitiveMap for DataFrameReader/Writer options

2020-07-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32364:
--
Reporter: Girish A Pandit  (was: Dongjoon Hyun)

> Use CaseInsensitiveMap for DataFrameReader/Writer options
> -
>
> Key: SPARK-32364
> URL: https://issues.apache.org/jira/browse/SPARK-32364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Girish A Pandit
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> When a user have multiple options like path, paTH, and PATH for the same key 
> path, option/options is non-deterministic because extraOptions is HashMap. 
> This issue aims to use *CaseInsensitiveMap* instead of *HashMap* to fix this 
> bug fundamentally.
> {code}
> spark.read
>   .option("paTh", "1")
>   .option("PATH", "2")
>   .option("Path", "3")
>   .option("patH", "4")
>   .load("5")
> ...
> org.apache.spark.sql.AnalysisException:
> Path does not exist: file:/.../1;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32412) Unify error handling for spark thrift server operations

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32412:


Assignee: Apache Spark

> Unify error handling for spark thrift server operations
> ---
>
> Key: SPARK-32412
> URL: https://issues.apache.org/jira/browse/SPARK-32412
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> Log error only once at the server-side for all kinds of operations in both 
> async and sync mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32412) Unify error handling for spark thrift server operations

2020-07-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32412:


Assignee: (was: Apache Spark)

> Unify error handling for spark thrift server operations
> ---
>
> Key: SPARK-32412
> URL: https://issues.apache.org/jira/browse/SPARK-32412
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Log error only once at the server-side for all kinds of operations in both 
> async and sync mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32412) Unify error handling for spark thrift server operations

2020-07-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163539#comment-17163539
 ] 

Apache Spark commented on SPARK-32412:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29204

> Unify error handling for spark thrift server operations
> ---
>
> Key: SPARK-32412
> URL: https://issues.apache.org/jira/browse/SPARK-32412
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Log error only once at the server-side for all kinds of operations in both 
> async and sync mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32413) Guidance for my project

2020-07-23 Thread Suat Toksoz (Jira)

Suat Toksoz created SPARK-32413:
---

 Summary: Guidance for my project 
 Key: SPARK-32413
 URL: https://issues.apache.org/jira/browse/SPARK-32413
 Project: Spark
  Issue Type: Brainstorming
  Components: PySpark, Spark Core, SparkR
Affects Versions: 3.0.0
Reporter: Suat Toksoz


hi,

I am planning to get-read elasticsearch index continuously, and put that data 
on Data Frame and group that data, search and create an alert. I like to write 
my code in python.

For this purpose, what should I use, spark, jupter, pyspark... 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32412) Unify error handling for spark thrift server operations

2020-07-23 Thread Kent Yao (Jira)

Kent Yao created SPARK-32412:


 Summary: Unify error handling for spark thrift server operations
 Key: SPARK-32412
 URL: https://issues.apache.org/jira/browse/SPARK-32412
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


Log error only once at the server-side for all kinds of operations in both 
async and sync mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32411) GPU Cluster Fail

2020-07-23 Thread Vinh Tran (Jira)

Vinh Tran created SPARK-32411:
-

 Summary: GPU Cluster Fail
 Key: SPARK-32411
 URL: https://issues.apache.org/jira/browse/SPARK-32411
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Web UI
Affects Versions: 3.0.0
 Environment: Ihave a Apache Spark 3.0 cluster consisting of machines 
with multiple nvidia-gpus and I connect my jupyter notebook to the cluster 
using pyspark,
Reporter: Vinh Tran


I'm having a difficult time getting a GPU cluster started on Apache Spark 3.0. 
It was hard to find documentation on this, but I stumbled on a NVIDIA github 
page for Rapids which suggested the following additional edits to the 
spark-defaults.conf:
{code:java}
spark.task.resource.gpu.amount 0.25
spark.executor.resource.gpu.discoveryScript 
./usr/local/spark/getGpusResources.sh{code}
I have a Apache Spark 3.0 cluster consisting of machines with multiple 
nvidia-gpus and I connect my jupyter notebook to the cluster using pyspark, 
however it results in the following error: 
{code:java}
Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: You must specify an amount for gpu
at 
org.apache.spark.resource.ResourceUtils$.$anonfun$parseResourceRequest$1(ResourceUtils.scala:142)
at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
at 
org.apache.spark.resource.ResourceUtils$.parseResourceRequest(ResourceUtils.scala:142)
at 
org.apache.spark.resource.ResourceUtils$.$anonfun$parseAllResourceRequests$1(ResourceUtils.scala:159)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:159)
at 
org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
at 
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
at org.apache.spark.SparkContext.(SparkContext.scala:528)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
{code}
After this, I then tried adding another line to the conf per the instructions 
which results in no errors, however when I log in to the Web UI at 
localhost:8080, under Running Applications, the state remains at waiting.
{code:java}
spark.task.resource.gpu.amount  2
spark.executor.resource.gpu.discoveryScript
./usr/local/spark/getGpusResources.sh
spark.executor.resource.gpu.amount  1
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32408) Disable crossPaths only in GitHub Actions to prevent side effects

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32408:
-
Parent: SPARK-32244
Issue Type: Sub-task  (was: Improvement)

> Disable crossPaths only in GitHub Actions to prevent side effects
> -
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep -r "target/scala-"}}.
> To minimise the side effects, we should disable crossPaths only in GitHub 
> Actions build for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32408) Disable crossPaths only in GitHub Actions to prevent side effects

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32408:
-
Description: 
After SPARK-32245, crossPaths was disabled in SBT build to run the Junit tests 
per project properly.
This is correct change since we're not doing the cross build in SBT.

Now, the intermediate classes are placed without Scala version directory in SBT 
build specifically.
It seems causing side effects that are dependent on that path. See, for 
example, {{git grep \-r "target/scala\-"}}.

To minimise the side effects, we should disable crossPaths only in GitHub 
Actions build for now.

  was:
After SPARK-32245, crossPaths was disabled in SBT build to run the Junit tests 
per project properly.
This is correct change since we're not doing the cross build in SBT.

Now, the intermediate classes are placed without Scala version directory in SBT 
build specifically.
It seems causing side effects that are dependent on that path. See, for 
example, {{git grep -r "target/scala-"}}.

To minimise the side effects, we should disable crossPaths only in GitHub 
Actions build for now.


> Disable crossPaths only in GitHub Actions to prevent side effects
> -
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep \-r "target/scala\-"}}.
> To minimise the side effects, we should disable crossPaths only in GitHub 
> Actions build for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32408) Disable crossPaths only in GitHub Actions to prevent side effects

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32408:
-
Issue Type: Test  (was: Improvement)

> Disable crossPaths only in GitHub Actions to prevent side effects
> -
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep -r "target/scala-"}}.
> To minimise the side effects, we should disable crossPaths only in GitHub 
> Actions build for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32408) Disable crossPaths only in GitHub Actions to prevent side effects

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32408:
-
Issue Type: Improvement  (was: Test)

> Disable crossPaths only in GitHub Actions to prevent side effects
> -
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep -r "target/scala-"}}.
> To minimise the side effects, we should disable crossPaths only in GitHub 
> Actions build for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32408) Disable crossPaths only in GitHub Actions to prevent side effects

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32408:
-
Description: 
After SPARK-32245, crossPaths was disabled in SBT build to run the Junit tests 
per project properly.
This is correct change since we're not doing the cross build in SBT.

Now, the intermediate classes are placed without Scala version directory in SBT 
build specifically.
It seems causing side effects that are dependent on that path. See, for 
example, {{git grep -r "target/scala-"}}.

To minimise the side effects, we should disable crossPaths only in GitHub 
Actions build for now.

  was:
After SPARK-32245, crossPaths was disabled in SBT build to run the Junit tests 
per project properly.
This is correct change

Now, the intermediate classes are placed without Scala version directory in SBT 
build specifically.

We should reflect this changes in particular about classpathes.

SBT assembly does not get affected so it is mostly just test-only.


> Disable crossPaths only in GitHub Actions to prevent side effects
> -
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change since we're not doing the cross build in SBT.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> It seems causing side effects that are dependent on that path. See, for 
> example, {{git grep -r "target/scala-"}}.
> To minimise the side effects, we should disable crossPaths only in GitHub 
> Actions build for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32408) Disable crossPaths only in GitHub Actions to prevent side effects

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32408:
-
Description: 
After SPARK-32245, crossPaths was disabled in SBT build to run the Junit tests 
per project properly.
This is correct change

Now, the intermediate classes are placed without Scala version directory in SBT 
build specifically.

We should reflect this changes in particular about classpathes.

SBT assembly does not get affected so it is mostly just test-only.

  was:
After SPARK-32245, crossPaths was disabled in SBT build to run the Junit tests 
per project properly.

Now, the intermediate classes are placed without Scala version directory in SBT 
build specifically.

We should reflect this changes in particular about classpathes.

SBT assembly does not get affected so it is mostly just test-only.


> Disable crossPaths only in GitHub Actions to prevent side effects
> -
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> This is correct change
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> We should reflect this changes in particular about classpathes.
> SBT assembly does not get affected so it is mostly just test-only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32408) Disable crossPaths only in GitHub Actions to prevent side effects

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32408:
-
Summary: Disable crossPaths only in GitHub Actions to prevent side effects  
(was: Reflect the removed Scala version directory to the classpaths)

> Disable crossPaths only in GitHub Actions to prevent side effects
> -
>
> Key: SPARK-32408
> URL: https://issues.apache.org/jira/browse/SPARK-32408
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32245, crossPaths was disabled in SBT build to run the Junit 
> tests per project properly.
> Now, the intermediate classes are placed without Scala version directory in 
> SBT build specifically.
> We should reflect this changes in particular about classpathes.
> SBT assembly does not get affected so it is mostly just test-only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32389) Add all hive.execution suites in the parallel test group

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32389.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28977
[https://github.com/apache/spark/pull/28977]

> Add all hive.execution suites in the parallel test group
> 
>
> Key: SPARK-32389
> URL: https://issues.apache.org/jira/browse/SPARK-32389
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.1.0
>
>
> Similar to SPARK-27460, we add an extra parallel test group for all 
> `hive.executiton` suites to reduce the Jenkins testing time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32389) Add all hive.execution suites in the parallel test group

2020-07-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32389:


Assignee: Yuanjian Li

> Add all hive.execution suites in the parallel test group
> 
>
> Key: SPARK-32389
> URL: https://issues.apache.org/jira/browse/SPARK-32389
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> Similar to SPARK-27460, we add an extra parallel test group for all 
> `hive.executiton` suites to reduce the Jenkins testing time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 133 matches

Mail list logo