[jira] [Assigned] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing

2020-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32508:
---

Assignee: dzcxzl

> Disallow empty part col values in partition spec before static partition 
> writing
> 
>
> Key: SPARK-32508
> URL: https://issues.apache.org/jira/browse/SPARK-32508
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
>
> When writing to the current static partition, the partition field is empty, 
> and an error will be reported when all tasks are completed.
> We can prevent such behavior before submitting the task.
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key d is null or empty;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing

2020-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32508.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29316
[https://github.com/apache/spark/pull/29316]

> Disallow empty part col values in partition spec before static partition 
> writing
> 
>
> Key: SPARK-32508
> URL: https://issues.apache.org/jira/browse/SPARK-32508
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.1.0
>
>
> When writing to the current static partition, the partition field is empty, 
> and an error will be reported when all tasks are completed.
> We can prevent such behavior before submitting the task.
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key d is null or empty;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2020-09-16 Thread Terry Kim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197406#comment-17197406
 ] 

Terry Kim commented on SPARK-29900:
---

[~laurikoobas] Can you share a repro example?

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32189) Development - Setting up PyCharm

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197402#comment-17197402
 ] 

Apache Spark commented on SPARK-32189:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/29781

> Development - Setting up PyCharm
> 
>
> Key: SPARK-32189
> URL: https://issues.apache.org/jira/browse/SPARK-32189
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Haejoon Lee
>Priority: Major
>
> PyCharm is probably one of the most widely used IDE for Python development. 
> We should document a standard way so users can easily set up.
> Very rough example: 
> https://hyukjin-spark.readthedocs.io/en/latest/development/developer-tools.html#setup-pycharm-with-pyspark
>  but it should be more in details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32189) Development - Setting up PyCharm

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32189:


Assignee: Apache Spark  (was: Haejoon Lee)

> Development - Setting up PyCharm
> 
>
> Key: SPARK-32189
> URL: https://issues.apache.org/jira/browse/SPARK-32189
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> PyCharm is probably one of the most widely used IDE for Python development. 
> We should document a standard way so users can easily set up.
> Very rough example: 
> https://hyukjin-spark.readthedocs.io/en/latest/development/developer-tools.html#setup-pycharm-with-pyspark
>  but it should be more in details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32189) Development - Setting up PyCharm

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32189:


Assignee: Haejoon Lee  (was: Apache Spark)

> Development - Setting up PyCharm
> 
>
> Key: SPARK-32189
> URL: https://issues.apache.org/jira/browse/SPARK-32189
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Haejoon Lee
>Priority: Major
>
> PyCharm is probably one of the most widely used IDE for Python development. 
> We should document a standard way so users can easily set up.
> Very rough example: 
> https://hyukjin-spark.readthedocs.io/en/latest/development/developer-tools.html#setup-pycharm-with-pyspark
>  but it should be more in details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2020-09-16 Thread Lauri Koobas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197395#comment-17197395
 ] 

Lauri Koobas commented on SPARK-29900:
--

The problem actually arose from using the `SQLContext.tableNames` that does NOT 
filter out any temporary views even if you pass it a database name.

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32186) Development - Debugging

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197393#comment-17197393
 ] 

Apache Spark commented on SPARK-32186:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/29781

> Development - Debugging
> ---
>
> Key: SPARK-32186
> URL: https://issues.apache.org/jira/browse/SPARK-32186
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> 1. Python Profiler: 
> https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/profiler.html,
>  
> {code}
> >>> sc._conf.set("spark.python.profile", "true")
> >>> rdd = sc.parallelize(range(100)).map(str)
> >>> rdd.count()
> 100
> >>> sc.show_profiles()
> {code}
> 2. Python Debugger, e.g.) 
> https://www.jetbrains.com/help/pycharm/remote-debugging-with-product.html#remote-interpreter
>  \(?\)
> 3. Monitoring Python Workers: {{top}}, `{{ls}}`, etc. \(?\)
> 4. PyCharm setup \(?\) (at https://issues.apache.org/jira/browse/SPARK-32189)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32186) Development - Debugging

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197391#comment-17197391
 ] 

Apache Spark commented on SPARK-32186:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/29781

> Development - Debugging
> ---
>
> Key: SPARK-32186
> URL: https://issues.apache.org/jira/browse/SPARK-32186
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> 1. Python Profiler: 
> https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/profiler.html,
>  
> {code}
> >>> sc._conf.set("spark.python.profile", "true")
> >>> rdd = sc.parallelize(range(100)).map(str)
> >>> rdd.count()
> 100
> >>> sc.show_profiles()
> {code}
> 2. Python Debugger, e.g.) 
> https://www.jetbrains.com/help/pycharm/remote-debugging-with-product.html#remote-interpreter
>  \(?\)
> 3. Monitoring Python Workers: {{top}}, `{{ls}}`, etc. \(?\)
> 4. PyCharm setup \(?\) (at https://issues.apache.org/jira/browse/SPARK-32189)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32903) GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32903.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29776
[https://github.com/apache/spark/pull/29776]

> GeneratePredicate should be able to eliminate common sub-expressions
> 
>
> Key: SPARK-32903
> URL: https://issues.apache.org/jira/browse/SPARK-32903
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> Both {{GenerateMutableProjection}} and {{GenerateUnsafeProjection}}, such 
> codegen objects can eliminate common sub-expressions. But 
> {{GeneratePredicate}} currently doesn't do it.
> We encounter a customer issue that a Filter pushed down through a Project 
> causes performance issue, compared with not pushed down case. The issue is 
> one expression used in Filter predicates are run many times. Due to the 
> complex schema, the query nodes are not wholestage codegen, so it runs 
> {{Filter.doExecute}} and then call {{GeneratePredicate}}. The common 
> expression was run many time and became performance bottleneck. 
> {{GeneratePredicate}} should be able to eliminate common sub-expressions for 
> such case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27589) Spark file source V2

2020-09-16 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197377#comment-17197377
 ] 

Gengliang Wang commented on SPARK-27589:


[~dongjoon] Thanks for reminder. I will revisit this part recently and create a 
proper JIRA ticket for it.

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32906) Struct field names should not change after normalizing floats

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197374#comment-17197374
 ] 

Apache Spark commented on SPARK-32906:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29780

> Struct field names should not change after normalizing floats
> -
>
> Key: SPARK-32906
> URL: https://issues.apache.org/jira/browse/SPARK-32906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket aims at fixing a minor bug when normalizing floats for struct 
> types;
> {code}
> scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec
> scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k")
> scala> val agg = df.distinct()
> scala> agg.explain()
> == Physical Plan ==
> *(2) HashAggregate(keys=[k#40], functions=[])
> +- Exchange hashpartitioning(k#40, 200), true, [id=#62]
>+- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) 
> null else named_struct(col1, 
> knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], 
> functions=[])
>   +- *(1) LocalTableScan [k#40]
> scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: 
> HashAggregateExec => a.output.head }
> scala> aggOutput.foreach { attr => println(attr.prettyJson) }
> ### Final Aggregate ###
> [ {
>   "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
>   "num-children" : 0,
>   "name" : "k",
>   "dataType" : {
> "type" : "struct",
> "fields" : [ {
>   "name" : "_1",
> ^^^
>   "type" : "double",
>   "nullable" : false,
>   "metadata" : { }
> } ]
>   },
>   "nullable" : true,
>   "metadata" : { },
>   "exprId" : {
> "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
> "id" : 40,
> "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
>   },
>   "qualifier" : [ ]
> } ]
> ### Partial Aggregate ###
> [ {
>   "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
>   "num-children" : 0,
>   "name" : "k",
>   "dataType" : {
> "type" : "struct",
> "fields" : [ {
>   "name" : "col1",
> 
>   "type" : "double",
>   "nullable" : true,
>   "metadata" : { }
> } ]
>   },
>   "nullable" : true,
>   "metadata" : { },
>   "exprId" : {
> "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
> "id" : 40,
> "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
>   },
>   "qualifier" : [ ]
> } ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32906) Struct field names should not change after normalizing floats

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32906:


Assignee: Apache Spark

> Struct field names should not change after normalizing floats
> -
>
> Key: SPARK-32906
> URL: https://issues.apache.org/jira/browse/SPARK-32906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> This ticket aims at fixing a minor bug when normalizing floats for struct 
> types;
> {code}
> scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec
> scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k")
> scala> val agg = df.distinct()
> scala> agg.explain()
> == Physical Plan ==
> *(2) HashAggregate(keys=[k#40], functions=[])
> +- Exchange hashpartitioning(k#40, 200), true, [id=#62]
>+- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) 
> null else named_struct(col1, 
> knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], 
> functions=[])
>   +- *(1) LocalTableScan [k#40]
> scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: 
> HashAggregateExec => a.output.head }
> scala> aggOutput.foreach { attr => println(attr.prettyJson) }
> ### Final Aggregate ###
> [ {
>   "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
>   "num-children" : 0,
>   "name" : "k",
>   "dataType" : {
> "type" : "struct",
> "fields" : [ {
>   "name" : "_1",
> ^^^
>   "type" : "double",
>   "nullable" : false,
>   "metadata" : { }
> } ]
>   },
>   "nullable" : true,
>   "metadata" : { },
>   "exprId" : {
> "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
> "id" : 40,
> "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
>   },
>   "qualifier" : [ ]
> } ]
> ### Partial Aggregate ###
> [ {
>   "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
>   "num-children" : 0,
>   "name" : "k",
>   "dataType" : {
> "type" : "struct",
> "fields" : [ {
>   "name" : "col1",
> 
>   "type" : "double",
>   "nullable" : true,
>   "metadata" : { }
> } ]
>   },
>   "nullable" : true,
>   "metadata" : { },
>   "exprId" : {
> "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
> "id" : 40,
> "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
>   },
>   "qualifier" : [ ]
> } ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32906) Struct field names should not change after normalizing floats

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197373#comment-17197373
 ] 

Apache Spark commented on SPARK-32906:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29780

> Struct field names should not change after normalizing floats
> -
>
> Key: SPARK-32906
> URL: https://issues.apache.org/jira/browse/SPARK-32906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket aims at fixing a minor bug when normalizing floats for struct 
> types;
> {code}
> scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec
> scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k")
> scala> val agg = df.distinct()
> scala> agg.explain()
> == Physical Plan ==
> *(2) HashAggregate(keys=[k#40], functions=[])
> +- Exchange hashpartitioning(k#40, 200), true, [id=#62]
>+- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) 
> null else named_struct(col1, 
> knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], 
> functions=[])
>   +- *(1) LocalTableScan [k#40]
> scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: 
> HashAggregateExec => a.output.head }
> scala> aggOutput.foreach { attr => println(attr.prettyJson) }
> ### Final Aggregate ###
> [ {
>   "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
>   "num-children" : 0,
>   "name" : "k",
>   "dataType" : {
> "type" : "struct",
> "fields" : [ {
>   "name" : "_1",
> ^^^
>   "type" : "double",
>   "nullable" : false,
>   "metadata" : { }
> } ]
>   },
>   "nullable" : true,
>   "metadata" : { },
>   "exprId" : {
> "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
> "id" : 40,
> "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
>   },
>   "qualifier" : [ ]
> } ]
> ### Partial Aggregate ###
> [ {
>   "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
>   "num-children" : 0,
>   "name" : "k",
>   "dataType" : {
> "type" : "struct",
> "fields" : [ {
>   "name" : "col1",
> 
>   "type" : "double",
>   "nullable" : true,
>   "metadata" : { }
> } ]
>   },
>   "nullable" : true,
>   "metadata" : { },
>   "exprId" : {
> "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
> "id" : 40,
> "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
>   },
>   "qualifier" : [ ]
> } ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32906) Struct field names should not change after normalizing floats

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32906:


Assignee: (was: Apache Spark)

> Struct field names should not change after normalizing floats
> -
>
> Key: SPARK-32906
> URL: https://issues.apache.org/jira/browse/SPARK-32906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket aims at fixing a minor bug when normalizing floats for struct 
> types;
> {code}
> scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec
> scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k")
> scala> val agg = df.distinct()
> scala> agg.explain()
> == Physical Plan ==
> *(2) HashAggregate(keys=[k#40], functions=[])
> +- Exchange hashpartitioning(k#40, 200), true, [id=#62]
>+- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) 
> null else named_struct(col1, 
> knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], 
> functions=[])
>   +- *(1) LocalTableScan [k#40]
> scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: 
> HashAggregateExec => a.output.head }
> scala> aggOutput.foreach { attr => println(attr.prettyJson) }
> ### Final Aggregate ###
> [ {
>   "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
>   "num-children" : 0,
>   "name" : "k",
>   "dataType" : {
> "type" : "struct",
> "fields" : [ {
>   "name" : "_1",
> ^^^
>   "type" : "double",
>   "nullable" : false,
>   "metadata" : { }
> } ]
>   },
>   "nullable" : true,
>   "metadata" : { },
>   "exprId" : {
> "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
> "id" : 40,
> "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
>   },
>   "qualifier" : [ ]
> } ]
> ### Partial Aggregate ###
> [ {
>   "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
>   "num-children" : 0,
>   "name" : "k",
>   "dataType" : {
> "type" : "struct",
> "fields" : [ {
>   "name" : "col1",
> 
>   "type" : "double",
>   "nullable" : true,
>   "metadata" : { }
> } ]
>   },
>   "nullable" : true,
>   "metadata" : { },
>   "exprId" : {
> "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
> "id" : 40,
> "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
>   },
>   "qualifier" : [ ]
> } ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28396) Add PathCatalog for data source V2

2020-09-16 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197370#comment-17197370
 ] 

Gengliang Wang commented on SPARK-28396:


[~dongjoon] It's very likely that this will be done. But there is no detailed 
solution yet. Sorry, I should mark it as "later".
Once there is a conclusion, or someone in the community comes up with a good 
proposal, let's reopen this.
cc [~maxgekk]

> Add PathCatalog for data source V2
> --
>
> Key: SPARK-28396
> URL: https://issues.apache.org/jira/browse/SPARK-28396
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add PathCatalog for data source V2, so that:
> 1. We can convert SaveMode in DataFrameWriter into catalog table operations, 
> instead of supporting SaveMode in file source V2.
> 2. Support create-table SQL statements like "CREATE TABLE orc.'path'"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32906) Struct field names should not change after normalizing floats

2020-09-16 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-32906:


 Summary: Struct field names should not change after normalizing 
floats
 Key: SPARK-32906
 URL: https://issues.apache.org/jira/browse/SPARK-32906
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2, 3.1.0
Reporter: Takeshi Yamamuro


This ticket aims at fixing a minor bug when normalizing floats for struct types;
{code}
scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec
scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k")
scala> val agg = df.distinct()
scala> agg.explain()
== Physical Plan ==
*(2) HashAggregate(keys=[k#40], functions=[])
+- Exchange hashpartitioning(k#40, 200), true, [id=#62]
   +- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) 
null else named_struct(col1, 
knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], 
functions=[])
  +- *(1) LocalTableScan [k#40]

scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: 
HashAggregateExec => a.output.head }
scala> aggOutput.foreach { attr => println(attr.prettyJson) }
### Final Aggregate ###
[ {
  "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
  "num-children" : 0,
  "name" : "k",
  "dataType" : {
"type" : "struct",
"fields" : [ {
  "name" : "_1",
^^^
  "type" : "double",
  "nullable" : false,
  "metadata" : { }
} ]
  },
  "nullable" : true,
  "metadata" : { },
  "exprId" : {
"product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
"id" : 40,
"jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
  },
  "qualifier" : [ ]
} ]

### Partial Aggregate ###
[ {
  "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
  "num-children" : 0,
  "name" : "k",
  "dataType" : {
"type" : "struct",
"fields" : [ {
  "name" : "col1",

  "type" : "double",
  "nullable" : true,
  "metadata" : { }
} ]
  },
  "nullable" : true,
  "metadata" : { },
  "exprId" : {
"product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
"id" : 40,
"jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
  },
  "qualifier" : [ ]
} ]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32180) Getting Started - Installation

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197358#comment-17197358
 ] 

Apache Spark commented on SPARK-32180:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29779

> Getting Started - Installation
> --
>
> Key: SPARK-32180
> URL: https://issues.apache.org/jira/browse/SPARK-32180
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Rohit Mishra
>Priority: Minor
> Fix For: 3.1.0
>
>
> Example:
> https://koalas.readthedocs.io/en/latest/getting_started/install.html
> https://pandas.pydata.org/docs/getting_started/install.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32180) Getting Started - Installation

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197357#comment-17197357
 ] 

Apache Spark commented on SPARK-32180:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29779

> Getting Started - Installation
> --
>
> Key: SPARK-32180
> URL: https://issues.apache.org/jira/browse/SPARK-32180
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Rohit Mishra
>Priority: Minor
> Fix For: 3.1.0
>
>
> Example:
> https://koalas.readthedocs.io/en/latest/getting_started/install.html
> https://pandas.pydata.org/docs/getting_started/install.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32898) totalExecutorRunTimeMs is too big

2020-09-16 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197341#comment-17197341
 ] 

wuyi commented on SPARK-32898:
--

I think the issue is(for executorRunTimeMs): Before a task reaches to 
"taskStartTimeNs = System.nanoTime()", it might be already killed(e.g., by 
another successful attempt).  So, taskStartTimeNs can not get initialized and 
remains 0. However, the executorRunTimeMs is calculated by "System.nanoTime() - 
taskStartTimeNs" in collectAccumulatorsAndResetStatusOnFailure, which is 
obviously a wrong big result when taskStartTimeNs = 0.

 

I haven't taken a detail look for the submissionTime, but it sounds like it's a 
different issue? Though, it may be due to the same logic hole.

 

I'd like to make a fix for the executorRunTimeMs first if [~linhongliu-db] 
doesn't mind.

> totalExecutorRunTimeMs is too big
> -
>
> Key: SPARK-32898
> URL: https://issues.apache.org/jira/browse/SPARK-32898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Linhong Liu
>Priority: Major
>
> This might be because of incorrectly calculating executorRunTimeMs in 
> Executor.scala
>  The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can 
> be called when taskStartTimeNs is not set yet (it is 0).
> As of now in master branch, here is the problematic code: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470]
>  
> There is a throw exception before this line. The catch branch still updates 
> the metric.
>  However the query shows as SUCCESSful. Maybe this task is speculative. Not 
> sure.
>  
> submissionTime in LiveExecutionData may also have similar problem.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18409) LSH approxNearestNeighbors should use approxQuantile instead of sort

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197338#comment-17197338
 ] 

Apache Spark commented on SPARK-18409:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29778

> LSH approxNearestNeighbors should use approxQuantile instead of sort
> 
>
> Key: SPARK-18409
> URL: https://issues.apache.org/jira/browse/SPARK-18409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Huaxin Gao
>Priority: Major
>  Labels: bulk-closed
>
> LSHModel.approxNearestNeighbors sorts the full dataset on the hashDistance in 
> order to find a threshold.  It should use approxQuantile instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32905) ApplicationMaster fails to receive UpdateDelegationTokens message

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32905:


Assignee: (was: Apache Spark)

> ApplicationMaster fails to receive UpdateDelegationTokens message
> -
>
> Key: SPARK-32905
> URL: https://issues.apache.org/jira/browse/SPARK-32905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, 
> launching executors on 0 of them.
> 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
> org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
> does not implement 'receive'
>   at 
> org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
> org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
> does not implement 'receive'
>   at 
> org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> With a long-running application in kerberized mode, the AMEndpiont handles 
> the token updating message wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32905) ApplicationMaster fails to receive UpdateDelegationTokens message

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197333#comment-17197333
 ] 

Apache Spark commented on SPARK-32905:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29777

> ApplicationMaster fails to receive UpdateDelegationTokens message
> -
>
> Key: SPARK-32905
> URL: https://issues.apache.org/jira/browse/SPARK-32905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, 
> launching executors on 0 of them.
> 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
> org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
> does not implement 'receive'
>   at 
> org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
> org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
> does not implement 'receive'
>   at 
> org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> With a long-running application in kerberized mode, the AMEndpiont handles 
> the token updating message wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32905) ApplicationMaster fails to receive UpdateDelegationTokens message

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32905:


Assignee: Apache Spark

> ApplicationMaster fails to receive UpdateDelegationTokens message
> -
>
> Key: SPARK-32905
> URL: https://issues.apache.org/jira/browse/SPARK-32905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, 
> launching executors on 0 of them.
> 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
> org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
> does not implement 'receive'
>   at 
> org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
> org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
> does not implement 'receive'
>   at 
> org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> With a long-running application in kerberized mode, the AMEndpiont handles 
> the token updating message wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32905) ApplicationMaster fails to receive UpdateDelegationTokens message

2020-09-16 Thread Kent Yao (Jira)
Kent Yao created SPARK-32905:


 Summary: ApplicationMaster fails to receive UpdateDelegationTokens 
message
 Key: SPARK-32905
 URL: https://issues.apache.org/jira/browse/SPARK-32905
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 3.0.1, 3.1.0
Reporter: Kent Yao



{code:java}
20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, 
launching executors on 0 of them.
20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
does not implement 'receive'
at 
org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
does not implement 'receive'
at 
org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}


With a long-running application in kerberized mode, the AMEndpiont handles the 
token updating message wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26425) Add more constraint checks in file streaming source to avoid checkpoint corruption

2020-09-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-26425:


Assignee: Jungtaek Lim  (was: Tathagata Das)

> Add more constraint checks in file streaming source to avoid checkpoint 
> corruption
> --
>
> Key: SPARK-26425
> URL: https://issues.apache.org/jira/browse/SPARK-26425
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Tathagata Das
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.1.0
>
>
> Two issues observed in production. 
> - HDFSMetadataLog.getLatest() tries to read older versions when it is not 
> able to read the latest listed version file. Not sure why this was done but 
> this should not be done. If the latest listed file is not readable, then 
> something is horribly wrong and we should fail rather than report an older 
> version as that can completely corrupt the checkpoint directory. 
> - FileStreamSource should check whether adding the a new batch to the 
> FileStreamSourceLog succeeded or not (similar to how StreamExecution checks 
> for the OffsetSeqLog)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26425) Add more constraint checks in file streaming source to avoid checkpoint corruption

2020-09-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-26425.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 25965
[https://github.com/apache/spark/pull/25965]

> Add more constraint checks in file streaming source to avoid checkpoint 
> corruption
> --
>
> Key: SPARK-26425
> URL: https://issues.apache.org/jira/browse/SPARK-26425
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.1.0
>
>
> Two issues observed in production. 
> - HDFSMetadataLog.getLatest() tries to read older versions when it is not 
> able to read the latest listed version file. Not sure why this was done but 
> this should not be done. If the latest listed file is not readable, then 
> something is horribly wrong and we should fail rather than report an older 
> version as that can completely corrupt the checkpoint directory. 
> - FileStreamSource should check whether adding the a new batch to the 
> FileStreamSourceLog succeeded or not (similar to how StreamExecution checks 
> for the OffsetSeqLog)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32904) pyspark.mllib.evaluation.MulticlassMetrics needs to swap the results of precision( ) and recall( )

2020-09-16 Thread TinaLi (Jira)
TinaLi created SPARK-32904:
--

 Summary: pyspark.mllib.evaluation.MulticlassMetrics needs to swap 
the results of precision( ) and recall( )
 Key: SPARK-32904
 URL: https://issues.apache.org/jira/browse/SPARK-32904
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 3.0.1
Reporter: TinaLi


[https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/MulticlassMetrics.html]

*The values returned by the precision() and recall() methods of this API should 
be swapped.*

Following is the example results I got when I run this API. It prints out 
precision  
metrics = MulticlassMetrics(predictionAndLabels)
print (metrics.confusionMatrix().toArray())
print ("precision: ",metrics.precision(1))
print ("recall: ",metrics.recall(1))
[[36631. 2845.]

[ 3839. 1610.]]

precision: 0.3613916947250281

recall: 0.2954670581758121

 
predictions.select('prediction').agg(\{'prediction':'sum'}).show()
|sum(prediction)| 5449.0|

As you can see, my model predicted 5449 cases with label=1, and 1610 out of the 
5449 cases are true positive, so precision should be  
1610/5449=0.2954670581758121, but this API assigned the precision value to 
recall() method, which should be swapped. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2020-09-16 Thread Terry Kim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197250#comment-17197250
 ] 

Terry Kim commented on SPARK-29900:
---

When you run `show tables`, you get the `isTemporary` column, so I don't think 
there is any confusion. For `SQLContext.tableNames`, you can pass database name 
(e.g., "default"), which will filter out any temporary views.

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30283) V2 Command logical plan should use UnresolvedV2Relation for a table

2020-09-16 Thread Terry Kim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197249#comment-17197249
 ] 

Terry Kim commented on SPARK-30283:
---

[~cloud_fan], this work stopped when alter table hit a bump in the road, but 
should we resume the work for commands like "drop table", etc., whose v2 
behavior is different from v1? WDYT?

> V2 Command logical plan should use UnresolvedV2Relation for a table
> ---
>
> Key: SPARK-30283
> URL: https://issues.apache.org/jira/browse/SPARK-30283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> For the following v2 commands, multi-part names are directly passed to the 
> command without looking up temp views, thus they are always resolved to 
> tables:
>  * DROP TABLE
>  * REFRESH TABLE
>  * RENAME TABLE
>  * REPLACE TABLE
> They should be updated to have UnresolvedV2Relation such that temp views are 
> looked up first in Analyzer.ResolveTables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27589) Spark file source V2

2020-09-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197228#comment-17197228
 ] 

Dongjoon Hyun commented on SPARK-27589:
---

[~Gengliang.Wang]. What is the new JIRA issue for tracking `1. Make the File 
source V2 writer working` because SPARK-28396 is closed as 'Won't Fix'?

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28396) Add PathCatalog for data source V2

2020-09-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197225#comment-17197225
 ] 

Dongjoon Hyun commented on SPARK-28396:
---

Hi, [~Gengliang.Wang] This is closed as 'Won't Fix'. So, there is no future 
plan even in Apache Spark 3.1.0?

Also, cc [~cloud_fan] and [~smilegator]

> Add PathCatalog for data source V2
> --
>
> Key: SPARK-28396
> URL: https://issues.apache.org/jira/browse/SPARK-28396
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add PathCatalog for data source V2, so that:
> 1. We can convert SaveMode in DataFrameWriter into catalog table operations, 
> instead of supporting SaveMode in file source V2.
> 2. Support create-table SQL statements like "CREATE TABLE orc.'path'"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32903) GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197223#comment-17197223
 ] 

Apache Spark commented on SPARK-32903:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29776

> GeneratePredicate should be able to eliminate common sub-expressions
> 
>
> Key: SPARK-32903
> URL: https://issues.apache.org/jira/browse/SPARK-32903
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Both {{GenerateMutableProjection}} and {{GenerateUnsafeProjection}}, such 
> codegen objects can eliminate common sub-expressions. But 
> {{GeneratePredicate}} currently doesn't do it.
> We encounter a customer issue that a Filter pushed down through a Project 
> causes performance issue, compared with not pushed down case. The issue is 
> one expression used in Filter predicates are run many times. Due to the 
> complex schema, the query nodes are not wholestage codegen, so it runs 
> {{Filter.doExecute}} and then call {{GeneratePredicate}}. The common 
> expression was run many time and became performance bottleneck. 
> {{GeneratePredicate}} should be able to eliminate common sub-expressions for 
> such case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32903) GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32903:


Assignee: Apache Spark  (was: L. C. Hsieh)

> GeneratePredicate should be able to eliminate common sub-expressions
> 
>
> Key: SPARK-32903
> URL: https://issues.apache.org/jira/browse/SPARK-32903
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Both {{GenerateMutableProjection}} and {{GenerateUnsafeProjection}}, such 
> codegen objects can eliminate common sub-expressions. But 
> {{GeneratePredicate}} currently doesn't do it.
> We encounter a customer issue that a Filter pushed down through a Project 
> causes performance issue, compared with not pushed down case. The issue is 
> one expression used in Filter predicates are run many times. Due to the 
> complex schema, the query nodes are not wholestage codegen, so it runs 
> {{Filter.doExecute}} and then call {{GeneratePredicate}}. The common 
> expression was run many time and became performance bottleneck. 
> {{GeneratePredicate}} should be able to eliminate common sub-expressions for 
> such case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32903) GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32903:


Assignee: L. C. Hsieh  (was: Apache Spark)

> GeneratePredicate should be able to eliminate common sub-expressions
> 
>
> Key: SPARK-32903
> URL: https://issues.apache.org/jira/browse/SPARK-32903
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Both {{GenerateMutableProjection}} and {{GenerateUnsafeProjection}}, such 
> codegen objects can eliminate common sub-expressions. But 
> {{GeneratePredicate}} currently doesn't do it.
> We encounter a customer issue that a Filter pushed down through a Project 
> causes performance issue, compared with not pushed down case. The issue is 
> one expression used in Filter predicates are run many times. Due to the 
> complex schema, the query nodes are not wholestage codegen, so it runs 
> {{Filter.doExecute}} and then call {{GeneratePredicate}}. The common 
> expression was run many time and became performance bottleneck. 
> {{GeneratePredicate}} should be able to eliminate common sub-expressions for 
> such case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32903) GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-32903:
---

 Summary: GeneratePredicate should be able to eliminate common 
sub-expressions
 Key: SPARK-32903
 URL: https://issues.apache.org/jira/browse/SPARK-32903
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Both {{GenerateMutableProjection}} and {{GenerateUnsafeProjection}}, such 
codegen objects can eliminate common sub-expressions. But {{GeneratePredicate}} 
currently doesn't do it.

We encounter a customer issue that a Filter pushed down through a Project 
causes performance issue, compared with not pushed down case. The issue is one 
expression used in Filter predicates are run many times. Due to the complex 
schema, the query nodes are not wholestage codegen, so it runs 
{{Filter.doExecute}} and then call {{GeneratePredicate}}. The common expression 
was run many time and became performance bottleneck. {{GeneratePredicate}} 
should be able to eliminate common sub-expressions for such case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32295:


Assignee: (was: Apache Spark)

> Add not null and size > 0 filters before inner explode to benefit from 
> predicate pushdown
> -
>
> Key: SPARK-32295
> URL: https://issues.apache.org/jira/browse/SPARK-32295
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: performance
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32295:


Assignee: Apache Spark

> Add not null and size > 0 filters before inner explode to benefit from 
> predicate pushdown
> -
>
> Key: SPARK-32295
> URL: https://issues.apache.org/jira/browse/SPARK-32295
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Major
>  Labels: performance
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-09-16 Thread Tanel Kiis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis reopened SPARK-32295:


> Add not null and size > 0 filters before inner explode to benefit from 
> predicate pushdown
> -
>
> Key: SPARK-32295
> URL: https://issues.apache.org/jira/browse/SPARK-32295
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: performance
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24994) Add UnwrapCastInBinaryComparison optimizer to simplify literal types

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197198#comment-17197198
 ] 

Apache Spark commented on SPARK-24994:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29775

> Add UnwrapCastInBinaryComparison optimizer to simplify literal types
> 
>
> Key: SPARK-24994
> URL: https://issues.apache.org/jira/browse/SPARK-24994
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liuxian
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> For this statement: select * from table1 where a = 100;
> the data type of `a` is `smallint` , because the defaut data type of 100 is 
> `int` ,so the data type of  'a' is converted to `int`.
> In this case, it does not support push down to parquet.
> In our business, for our SQL statements, and we generally do not convert 100 
> to `smallint`, We hope that it can support push down to parquet for this 
> situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32890) Pass all `sql/hive` module UTs in Scala 2.13

2020-09-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-32890:


Assignee: Yang Jie

> Pass all `sql/hive` module UTs in Scala 2.13
> 
>
> Key: SPARK-32890
> URL: https://issues.apache.org/jira/browse/SPARK-32890
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> There are only 4 test cases failed in sql hive module with cmd 
>  
> {code:java}
> mvn clean install -DskipTests -pl sql/hive -am -Pscala-2.13 -Phive
> mvn clean test -pl sql/hive -Pscala-2.13 -Phive{code}
>  
> The failed cases as follow:
>  * HiveSchemaInferenceSuite (1 FAILED)
>  * HiveSparkSubmitSuite (1 FAILED)
>  * StatisticsSuite (1 FAILED)
>  * HiveDDLSuite (1 FAILED)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32890) Pass all `sql/hive` module UTs in Scala 2.13

2020-09-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32890.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29760
[https://github.com/apache/spark/pull/29760]

> Pass all `sql/hive` module UTs in Scala 2.13
> 
>
> Key: SPARK-32890
> URL: https://issues.apache.org/jira/browse/SPARK-32890
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.1.0
>
>
> There are only 4 test cases failed in sql hive module with cmd 
>  
> {code:java}
> mvn clean install -DskipTests -pl sql/hive -am -Pscala-2.13 -Phive
> mvn clean test -pl sql/hive -Pscala-2.13 -Phive{code}
>  
> The failed cases as follow:
>  * HiveSchemaInferenceSuite (1 FAILED)
>  * HiveSparkSubmitSuite (1 FAILED)
>  * StatisticsSuite (1 FAILED)
>  * HiveDDLSuite (1 FAILED)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27589) Spark file source V2

2020-09-16 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197143#comment-17197143
 ] 

Thomas Graves commented on SPARK-27589:
---

somewhat related, I was looking through the v2 code for parquet and I don't see 
anything for bucketing, is bucketing supported with the V2 api?

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32897) SparkSession.builder.getOrCreate should not show deprecation warning of SQLContext

2020-09-16 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-32897:
--
Affects Version/s: (was: 2.4.7)

> SparkSession.builder.getOrCreate should not show deprecation warning of 
> SQLContext
> --
>
> Key: SPARK-32897
> URL: https://issues.apache.org/jira/browse/SPARK-32897
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> In PySpark shell:
> {code}
> import warnings
> from pyspark.sql import SparkSession, SQLContext
> warnings.simplefilter('always', DeprecationWarning)
> spark.stop()
> SparkSession.builder.getOrCreate()
> {code}
> shows a deprecation warning from {{SQLContext}}
> {code}
> /.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated 
> in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
>   DeprecationWarning)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32897) SparkSession.builder.getOrCreate should not show deprecation warning of SQLContext

2020-09-16 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-32897.
---
Fix Version/s: 3.1.0
   3.0.2
 Assignee: Hyukjin Kwon
   Resolution: Fixed

Issue resolved by pull request 29768
https://github.com/apache/spark/pull/29768

> SparkSession.builder.getOrCreate should not show deprecation warning of 
> SQLContext
> --
>
> Key: SPARK-32897
> URL: https://issues.apache.org/jira/browse/SPARK-32897
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> In PySpark shell:
> {code}
> import warnings
> from pyspark.sql import SparkSession, SQLContext
> warnings.simplefilter('always', DeprecationWarning)
> spark.stop()
> SparkSession.builder.getOrCreate()
> {code}
> shows a deprecation warning from {{SQLContext}}
> {code}
> /.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated 
> in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
>   DeprecationWarning)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32816) Planner error when aggregating multiple distinct DECIMAL columns

2020-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32816:
---

Assignee: Linhong Liu

> Planner error when aggregating multiple distinct DECIMAL columns
> 
>
> Key: SPARK-32816
> URL: https://issues.apache.org/jira/browse/SPARK-32816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
> Fix For: 3.1.0
>
>
> Running different DISTINCT decimal aggregations causes a query planner error:
> {code:java}
> java.lang.RuntimeException: You hit a query analyzer bug. Please report your 
> query to Spark user mailing list.
> at scala.sys.package$.error(package.scala:30)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:473)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:67)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:97)
>   at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:82)
>   at 
> scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
>   at 
> scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
> {code}
> example failing query
> {code:java}
> import org.apache.spark.util.Utils
> // Changing decimal(9, 0) to decimal(8, 0) fixes the problem. Root cause 
> seems to have to do with
> // UnscaledValue being used in one of the expressions but not the other.
> val df = spark.range(0, 5, 1, 1).selectExpr(
>   "id",
>   "cast(id as decimal(9, 0)) as ss_ext_list_price")
> val cacheDir = Utils.createTempDir().getCanonicalPath
> df.write.parquet(cacheDir)
> spark.read.parquet(cacheDir).createOrReplaceTempView("test_table")
> spark.sql("""
> select
> avg(distinct ss_ext_list_price), sum(distinct ss_ext_list_price)
> from test_table""").explain
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32816) Planner error when aggregating multiple distinct DECIMAL columns

2020-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32816.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29673
[https://github.com/apache/spark/pull/29673]

> Planner error when aggregating multiple distinct DECIMAL columns
> 
>
> Key: SPARK-32816
> URL: https://issues.apache.org/jira/browse/SPARK-32816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Linhong Liu
>Priority: Major
> Fix For: 3.1.0
>
>
> Running different DISTINCT decimal aggregations causes a query planner error:
> {code:java}
> java.lang.RuntimeException: You hit a query analyzer bug. Please report your 
> query to Spark user mailing list.
> at scala.sys.package$.error(package.scala:30)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:473)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:67)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:97)
>   at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:82)
>   at 
> scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
>   at 
> scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
> {code}
> example failing query
> {code:java}
> import org.apache.spark.util.Utils
> // Changing decimal(9, 0) to decimal(8, 0) fixes the problem. Root cause 
> seems to have to do with
> // UnscaledValue being used in one of the expressions but not the other.
> val df = spark.range(0, 5, 1, 1).selectExpr(
>   "id",
>   "cast(id as decimal(9, 0)) as ss_ext_list_price")
> val cacheDir = Utils.createTempDir().getCanonicalPath
> df.write.parquet(cacheDir)
> spark.read.parquet(cacheDir).createOrReplaceTempView("test_table")
> spark.sql("""
> select
> avg(distinct ss_ext_list_price), sum(distinct ss_ext_list_price)
> from test_table""").explain
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah closed SPARK-32888.
--

Resolved by adding documentation

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197073#comment-17197073
 ] 

L. C. Hsieh commented on SPARK-32888:
-

Yes, there is difference. But it is due to reading file and reading from 
RDD/Dataset. When reading file, we definitely know which line is first line, we 
can remove it. When we read from RDD/Dataset, we don't know which lines are 
first files in the reading files. So the best we can do, is just remove the 
lines same as the first line in the RDD/Dataset.

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197069#comment-17197069
 ] 

Punit Shah commented on SPARK-32888:


Thank you for your reply [~viirya]  However what I've noticed is that NO lines 
are removed during a straight csv import.  That is why my comment was put 
forth.  There is a difference in the results when reading from csv directly and 
when reading from rdd.

Rdds cause removal of lines, while straight csv don't.

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32898) totalExecutorRunTimeMs is too big

2020-09-16 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-32898:
-
Description: 
This might be because of incorrectly calculating executorRunTimeMs in 
Executor.scala
 The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can 
be called when taskStartTimeNs is not set yet (it is 0).

As of now in master branch, here is the problematic code: 

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470]

 

There is a throw exception before this line. The catch branch still updates the 
metric.
 However the query shows as SUCCESSful. Maybe this task is speculative. Not 
sure.

 

submissionTime in LiveExecutionData may also have similar problem.

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449]

 

  was:
This might be because of incorrectly calculating executorRunTimeMs in 
Executor.scala
The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can be 
called when taskStartTimeNs is not set yet (it is 0).

As of now in master branch, here is the problematic code: 

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470]

 

There is a throw exception before this line. The catch branch still updates the 
metric.
However the query shows as SUCCESSful in QPL. Maybe this task is speculative. 
Not sure.

 

submissionTime in LiveExecutionData may also have similar problem.

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449]

 


> totalExecutorRunTimeMs is too big
> -
>
> Key: SPARK-32898
> URL: https://issues.apache.org/jira/browse/SPARK-32898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Linhong Liu
>Priority: Major
>
> This might be because of incorrectly calculating executorRunTimeMs in 
> Executor.scala
>  The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can 
> be called when taskStartTimeNs is not set yet (it is 0).
> As of now in master branch, here is the problematic code: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470]
>  
> There is a throw exception before this line. The catch branch still updates 
> the metric.
>  However the query shows as SUCCESSful. Maybe this task is speculative. Not 
> sure.
>  
> submissionTime in LiveExecutionData may also have similar problem.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197046#comment-17197046
 ] 

L. C. Hsieh commented on SPARK-32888:
-

Reading csv files is simple. We can just remove first line. But when we read 
RDD of string of Dataset of String containing CSV lines, we don't know which 
lines are the first lines in files. So what we can do is just remove the lines 
same as the header.

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32850) Simply the RPC message flow of decommission

2020-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32850:
---

Assignee: wuyi

> Simply the RPC message flow of decommission
> ---
>
> Key: SPARK-32850
> URL: https://issues.apache.org/jira/browse/SPARK-32850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> The RPC message flow of decommission is kind of messy because multiple use 
> cases mixed together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32850) Simply the RPC message flow of decommission

2020-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32850.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29722
[https://github.com/apache/spark/pull/29722]

> Simply the RPC message flow of decommission
> ---
>
> Key: SPARK-32850
> URL: https://issues.apache.org/jira/browse/SPARK-32850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> The RPC message flow of decommission is kind of messy because multiple use 
> cases mixed together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196985#comment-17196985
 ] 

Punit Shah edited comment on SPARK-32888 at 9/16/20, 2:55 PM:
--

Why do we remove lines that are the same as the header? The result of this 
behaviour differs from reading csv files directly as opposed to from rdds.

[~viirya] would appreciate comment thanks...


was (Author: bullsoverbears):
Why do we remove lines that are the same as the header? The result of this 
behaviour differs from reading csv files directly as opposed to from rdds.

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32706) Poor performance when casting invalid decimal string to decimal type

2020-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32706.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29731
[https://github.com/apache/spark/pull/29731]

> Poor performance when casting invalid decimal string to decimal type
> 
>
> Key: SPARK-32706
> URL: https://issues.apache.org/jira/browse/SPARK-32706
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: part-0.parquet
>
>
> How to reproduce this issue:
> {code:java}
> spark.read.parquet("/path/to/part-0.parquet").selectExpr("cast(bd as 
> decimal(18, 0)) as x").write.mode("overwrite").save("/tmp/spark/decimal")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32706) Poor performance when casting invalid decimal string to decimal type

2020-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32706:
---

Assignee: Yuming Wang

> Poor performance when casting invalid decimal string to decimal type
> 
>
> Key: SPARK-32706
> URL: https://issues.apache.org/jira/browse/SPARK-32706
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Attachments: part-0.parquet
>
>
> How to reproduce this issue:
> {code:java}
> spark.read.parquet("/path/to/part-0.parquet").selectExpr("cast(bd as 
> decimal(18, 0)) as x").write.mode("overwrite").save("/tmp/spark/decimal")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196985#comment-17196985
 ] 

Punit Shah commented on SPARK-32888:


Why do we remove lines that are the same as the header? The result of this 
behaviour differs from reading csv files directly as opposed to from rdds.

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32902) Logging plan changes for AQE

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32902:


Assignee: Apache Spark

> Logging plan changes for AQE
> 
>
> Key: SPARK-32902
> URL: https://issues.apache.org/jira/browse/SPARK-32902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Major
>
> Recently, we added code to log plan changes in the preparation phase in 
> QueryExecution for execution (SPARK-32704). This ticket targets at applying 
> the same fix for logging plan changes in AQE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32902) Logging plan changes for AQE

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196977#comment-17196977
 ] 

Apache Spark commented on SPARK-32902:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29774

> Logging plan changes for AQE
> 
>
> Key: SPARK-32902
> URL: https://issues.apache.org/jira/browse/SPARK-32902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> Recently, we added code to log plan changes in the preparation phase in 
> QueryExecution for execution (SPARK-32704). This ticket targets at applying 
> the same fix for logging plan changes in AQE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32902) Logging plan changes for AQE

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32902:


Assignee: (was: Apache Spark)

> Logging plan changes for AQE
> 
>
> Key: SPARK-32902
> URL: https://issues.apache.org/jira/browse/SPARK-32902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> Recently, we added code to log plan changes in the preparation phase in 
> QueryExecution for execution (SPARK-32704). This ticket targets at applying 
> the same fix for logging plan changes in AQE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32902) Logging plan changes for AQE

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196978#comment-17196978
 ] 

Apache Spark commented on SPARK-32902:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29774

> Logging plan changes for AQE
> 
>
> Key: SPARK-32902
> URL: https://issues.apache.org/jira/browse/SPARK-32902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> Recently, we added code to log plan changes in the preparation phase in 
> QueryExecution for execution (SPARK-32704). This ticket targets at applying 
> the same fix for logging plan changes in AQE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32287) Flaky Test: ExecutorAllocationManagerSuite.add executors default profile

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196976#comment-17196976
 ] 

Apache Spark commented on SPARK-32287:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29773

> Flaky Test: ExecutorAllocationManagerSuite.add executors default profile
> 
>
> Key: SPARK-32287
> URL: https://issues.apache.org/jira/browse/SPARK-32287
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
>  This test becomes flaky in Github Actions, see: 
> https://github.com/apache/spark/pull/29072/checks?check_run_id=861689509
> {code:java}
> [info] - add executors default profile *** FAILED *** (33 milliseconds)
> [info]   4 did not equal 2 (ExecutorAllocationManagerSuite.scala:132)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
> [info]   at 
> org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$7(ExecutorAllocationManagerSuite.scala:132)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)
> [info]   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
> [info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> [info]   ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32287) Flaky Test: ExecutorAllocationManagerSuite.add executors default profile

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196975#comment-17196975
 ] 

Apache Spark commented on SPARK-32287:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29773

> Flaky Test: ExecutorAllocationManagerSuite.add executors default profile
> 
>
> Key: SPARK-32287
> URL: https://issues.apache.org/jira/browse/SPARK-32287
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
>  This test becomes flaky in Github Actions, see: 
> https://github.com/apache/spark/pull/29072/checks?check_run_id=861689509
> {code:java}
> [info] - add executors default profile *** FAILED *** (33 milliseconds)
> [info]   4 did not equal 2 (ExecutorAllocationManagerSuite.scala:132)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
> [info]   at 
> org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$7(ExecutorAllocationManagerSuite.scala:132)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)
> [info]   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
> [info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> [info]   ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32902) Logging plan changes for AQE

2020-09-16 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-32902:


 Summary: Logging plan changes for AQE
 Key: SPARK-32902
 URL: https://issues.apache.org/jira/browse/SPARK-32902
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Takeshi Yamamuro


Recently, we added code to log plan changes in the preparation phase in 
QueryExecution for execution (SPARK-32704). This ticket targets at applying the 
same fix for logging plan changes in AQE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32898) totalExecutorRunTimeMs is too big

2020-09-16 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196972#comment-17196972
 ] 

Thomas Graves commented on SPARK-32898:
---

[~linhongliu-db] can you please provide more of a description. You say this was 
too big, did it cause an error for your job or you just noticed the time was to 
big?  Do you have a reproducible case?

You have some details there about what might be wrong with  taskStartTimeNs 
possibly not initialized, if you can give more details there in generally that 
would be great as its a bit hard to follow your description.  If you have spent 
the time to debug you and have a fix in mind please feel free to put up a pull 
request.

> totalExecutorRunTimeMs is too big
> -
>
> Key: SPARK-32898
> URL: https://issues.apache.org/jira/browse/SPARK-32898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Linhong Liu
>Priority: Major
>
> This might be because of incorrectly calculating executorRunTimeMs in 
> Executor.scala
> The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can 
> be called when taskStartTimeNs is not set yet (it is 0).
> As of now in master branch, here is the problematic code: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470]
>  
> There is a throw exception before this line. The catch branch still updates 
> the metric.
> However the query shows as SUCCESSful in QPL. Maybe this task is speculative. 
> Not sure.
>  
> submissionTime in LiveExecutionData may also have similar problem.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32894) Timestamp cast in exernal orc table

2020-09-16 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32894:
--
Summary: Timestamp cast in exernal orc table  (was: Timestamp cast in 
exernal ocr table)

> Timestamp cast in exernal orc table
> ---
>
> Key: SPARK-32894
> URL: https://issues.apache.org/jira/browse/SPARK-32894
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
> Java 1.8
> Hadoop 3.3.0
> Hive 3.1.2
> Python 3.7 (from pyspark)
>Reporter: Grigory Skvortsov
>Priority: Major
>
> I have the external hive table stored as orc. I want to work with timestamp 
> column in my table using pyspark.
> For example, I try this:
>  spark.sql('select id, time_ from mydb.table1`).show()
>  
>  Py4JJavaError: An error occurred while calling o2877.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 19, 172.29.14.241, executor 1): java.lang.ClassCastException: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long
> at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
> at 
> org.apache.spark.sql.catalyst.expressions.MutableLong.update(SpecificInternalRow.scala:148)
> at 
> org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:228)
> at 
> org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53(HiveInspectors.scala:730)
> at 
> org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53$adapted(HiveInspectors.scala:730)
> at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$.$anonfun$unwrapOrcStructs$4(OrcFileFormat.scala:351)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:96)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:127)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950)
> at scala.Option.foreach(Option.scala:407)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152)
> at 
> org.apache.spark.s

[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-16 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32635:
--
Labels: correct  (was: )

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Blocker
>  Labels: correct
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-16 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32635:
--
Labels: correctness  (was: correct)

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Blocker
>  Labels: correctness
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-16 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32635:
--
Priority: Blocker  (was: Major)

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Blocker
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32185) User Guide - Monitoring

2020-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32185:


Assignee: Abhijeet Prasad

> User Guide - Monitoring
> ---
>
> Key: SPARK-32185
> URL: https://issues.apache.org/jira/browse/SPARK-32185
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Abhijeet Prasad
>Priority: Major
>
> Monitoring. We should focus on how to monitor PySpark jobs.
> - Custom Worker, see also 
> https://github.com/apache/spark/tree/master/python/test_coverage to enable 
> test coverage that include worker sides too.
> - Sentry Support \(?\) 
> https://blog.sentry.io/2019/11/12/sentry-for-data-error-monitoring-with-pyspark
> - Link back https://spark.apache.org/docs/latest/monitoring.html . 
> - ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32900) UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196931#comment-17196931
 ] 

Apache Spark commented on SPARK-32900:
--

User 'tomvanbussel' has created a pull request for this issue:
https://github.com/apache/spark/pull/29772

> UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in 
> the input and radix sorting is used.
> 
>
> Key: SPARK-32900
> URL: https://issues.apache.org/jira/browse/SPARK-32900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Tom van Bussel
>Priority: Major
>
> In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has 
> spilled already it checks whether {{upstream}} is an instance of 
> {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used (added by 
> SPARK-14851) and there are NULLs in the input however, upstream will be an 
> instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should 
> still be spilled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32900) UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32900:


Assignee: (was: Apache Spark)

> UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in 
> the input and radix sorting is used.
> 
>
> Key: SPARK-32900
> URL: https://issues.apache.org/jira/browse/SPARK-32900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Tom van Bussel
>Priority: Major
>
> In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has 
> spilled already it checks whether {{upstream}} is an instance of 
> {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used (added by 
> SPARK-14851) and there are NULLs in the input however, upstream will be an 
> instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should 
> still be spilled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32900) UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32900:


Assignee: Apache Spark

> UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in 
> the input and radix sorting is used.
> 
>
> Key: SPARK-32900
> URL: https://issues.apache.org/jira/browse/SPARK-32900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Tom van Bussel
>Assignee: Apache Spark
>Priority: Major
>
> In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has 
> spilled already it checks whether {{upstream}} is an instance of 
> {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used (added by 
> SPARK-14851) and there are NULLs in the input however, upstream will be an 
> instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should 
> still be spilled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196910#comment-17196910
 ] 

Apache Spark commented on SPARK-32635:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29771

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Major
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32635:


Assignee: Apache Spark

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Assignee: Apache Spark
>Priority: Major
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32635:


Assignee: (was: Apache Spark)

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Major
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32814) Metaclasses are broken for a few classes in Python 3

2020-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32814:


Assignee: Maciej Szymkiewicz

> Metaclasses are broken for a few classes in Python 3
> 
>
> Key: SPARK-32814
> URL: https://issues.apache.org/jira/browse/SPARK-32814
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> As of Python 3 {{__metaclass__}} is no longer supported 
> https://www.python.org/dev/peps/pep-3115/.
> However, we have multiple classes which where never migrated to Python 3 
> compatible syntax:
> - A number of ML {{Params}}} with {{__metaclass__ = ABCMeta}}
> - Some of the SQL {{types}} with {{__metaclass__ = DataTypeSingleton}}
> As a result some functionalities are broken in Python 3. For example 
> {code:python}
> >>> from pyspark.sql.types import BooleanType 
> >>>   
> >>>
> >>> BooleanType() is BooleanType()
> >>>   
> >>>
> False
> {code}
> or
> {code:python}
> >>> import inspect
> >>>   
> >>>
> >>> from pyspark.ml import Estimator  
> >>>   
> >>>
> >>> inspect.isabstract(Estimator) 
> >>>   
> >>>
> False
> {code}
> where in both cases we expect to see {{True}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32814) Metaclasses are broken for a few classes in Python 3

2020-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32814.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29664
[https://github.com/apache/spark/pull/29664]

> Metaclasses are broken for a few classes in Python 3
> 
>
> Key: SPARK-32814
> URL: https://issues.apache.org/jira/browse/SPARK-32814
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> As of Python 3 {{__metaclass__}} is no longer supported 
> https://www.python.org/dev/peps/pep-3115/.
> However, we have multiple classes which where never migrated to Python 3 
> compatible syntax:
> - A number of ML {{Params}}} with {{__metaclass__ = ABCMeta}}
> - Some of the SQL {{types}} with {{__metaclass__ = DataTypeSingleton}}
> As a result some functionalities are broken in Python 3. For example 
> {code:python}
> >>> from pyspark.sql.types import BooleanType 
> >>>   
> >>>
> >>> BooleanType() is BooleanType()
> >>>   
> >>>
> False
> {code}
> or
> {code:python}
> >>> import inspect
> >>>   
> >>>
> >>> from pyspark.ml import Estimator  
> >>>   
> >>>
> >>> inspect.isabstract(Estimator) 
> >>>   
> >>>
> False
> {code}
> where in both cases we expect to see {{True}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32835) Add withField to PySpark Column class

2020-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32835:


Assignee: Adam Binford

> Add withField to PySpark Column class
> -
>
> Key: SPARK-32835
> URL: https://issues.apache.org/jira/browse/SPARK-32835
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
>
> The withField functionality was added to the Scala API, we simply need a 
> Python wrapper to call it.
> Will need to do the same for dropField once it gets added back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32835) Add withField to PySpark Column class

2020-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32835.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29699
[https://github.com/apache/spark/pull/29699]

> Add withField to PySpark Column class
> -
>
> Key: SPARK-32835
> URL: https://issues.apache.org/jira/browse/SPARK-32835
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.1.0
>
>
> The withField functionality was added to the Scala API, we simply need a 
> Python wrapper to call it.
> Will need to do the same for dropField once it gets added back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32888:


Assignee: L. C. Hsieh

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32888.
--
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29765
[https://github.com/apache/spark/pull/29765]

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32901) UnsafeExternalSorter may cause a SparkOutOfMemoryError to be thrown while spilling

2020-09-16 Thread Tom van Bussel (Jira)
Tom van Bussel created SPARK-32901:
--

 Summary: UnsafeExternalSorter may cause a SparkOutOfMemoryError to 
be thrown while spilling
 Key: SPARK-32901
 URL: https://issues.apache.org/jira/browse/SPARK-32901
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1, 2.4.7
Reporter: Tom van Bussel


Consider the following sequence of events:
 # {{UnsafeExternalSorter}} runs out of space in its pointer array and attempts 
to allocate a large array to replace the current one.
 # {{TaskMemoryManager}} tries to allocate the memory backing the large array 
using {{MemoryManager}}, but {{MemoryManager}} is only willing to return most 
but not all of the memory requested.
 # {{TaskMemoryManager}} asks {{UnsafeExternalSorter}} to spill, which causes 
{{UnsafeExternalSorter}} to spill the current run to disk, to free its record 
pages and to reset its {{UnsafeInMemorySorter}}.
 # {{UnsafeInMemorySorter}} frees its pointer array, and tries to allocate a 
new small pointer array.
 # {{TaskMemoryManager}} tries to allocate the memory backing the small array 
using {{MemoryManager}}, but {{MemoryManager}} is unwilling to give it any 
memory, as the {{TaskMemoryManager}} is still holding on to the memory it got 
for the large array.
 # {{TaskMemoryManager}} again asks {{UnsafeExternalSorter}} to spill, but this 
time there is nothing to spill.
 # {{UnsafeInMemorySorter}} receives less memory than it requested, and causes 
a {{SparkOutOfMemoryError}} to be thrown, which causes the current task to fail.

A simple way to fix this is to avoid allocating a new array in 
{{UnsafeInMemorySorter.reset()}} and to do this on-demand instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32900) UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.

2020-09-16 Thread Tom van Bussel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom van Bussel updated SPARK-32900:
---
Description: In order to determine whether 
{{UnsafeExternalSorter.SpillableIterator}} has spilled already it checks 
whether {{upstream}} is an instance of {{UnsafeInMemorySorter.SortedIterator}}. 
When radix sorting is used (added by SPARK-14851) and there are NULLs in the 
input however, upstream will be an instance of 
{{UnsafeExternalSorter.ChainedIterator}} instead, but should still be spilled.  
(was: In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} 
has spilled already it checks whether {{upstream}} is an instance of 
{{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used and there 
are NULLs in the input however, upstream will be an instance of 
{{UnsafeExternalSorter.ChainedIterator}} instead, but should still be spilled.)

> UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in 
> the input and radix sorting is used.
> 
>
> Key: SPARK-32900
> URL: https://issues.apache.org/jira/browse/SPARK-32900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Tom van Bussel
>Priority: Major
>
> In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has 
> spilled already it checks whether {{upstream}} is an instance of 
> {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used (added by 
> SPARK-14851) and there are NULLs in the input however, upstream will be an 
> instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should 
> still be spilled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32900) UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.

2020-09-16 Thread Tom van Bussel (Jira)
Tom van Bussel created SPARK-32900:
--

 Summary: UnsafeExternalSorter.SpillableIterator cannot spill when 
there are NULLs in the input and radix sorting is used.
 Key: SPARK-32900
 URL: https://issues.apache.org/jira/browse/SPARK-32900
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1, 2.4.7
Reporter: Tom van Bussel


In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has 
spilled already it checks whether {{upstream}} is an instance of 
{{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used and there 
are NULLs in the input however, upstream will be an instance of 
{{UnsafeExternalSorter.ChainedIterator}} instead, but should still be spilled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2020-09-16 Thread Lauri Koobas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196812#comment-17196812
 ] 

Lauri Koobas edited comment on SPARK-29900 at 9/16/20, 9:18 AM:


Bringing up a related point -
{code:java}
show tables in {code}
always shows also temporary views. Same with
{code:java}
sqlContext.tableNames("database"){code}
This seems counter-intuitive.

There should be an additional flag to either include or exclude temporary views 
from that list.


was (Author: laurikoobas):
Bringing up a related point - `show tables in ` always shows also 
temporary views. Same with `sqlContext.tableNames("database")`. This seems 
counter-intuitive.

There should be an additional flag to either include or exclude temporary views 
from that list.

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2020-09-16 Thread Lauri Koobas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196812#comment-17196812
 ] 

Lauri Koobas commented on SPARK-29900:
--

Bringing up a related point - `show tables in ` always shows also 
temporary views. Same with `sqlContext.tableNames("database")`. This seems 
counter-intuitive.

There should be an additional flag to either include or exclude temporary views 
from that list.

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32894) Timestamp cast in exernal ocr table

2020-09-16 Thread Grigory Skvortsov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196783#comment-17196783
 ] 

Grigory Skvortsov edited comment on SPARK-32894 at 9/16/20, 8:32 AM:
-

>From hiveCli using following code:

 

CREATE EXTERNAL TABLE IF NOT EXISTS testtable(
 HOST String,
 ID bigint,
 TYPE int,
 TIME_ TIMESTAMP,
PARTITIONED BY (p1 String, p2 String)
CLUSTERED BY (host) INTO 5 BUCKETS
STORED AS ORC
LOCATION '/user/hive/warehouse/testtable';


was (Author: skvortsovg):
>From hiveCli using following code:

> Timestamp cast in exernal ocr table
> ---
>
> Key: SPARK-32894
> URL: https://issues.apache.org/jira/browse/SPARK-32894
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
> Java 1.8
> Hadoop 3.3.0
> Hive 3.1.2
> Python 3.7 (from pyspark)
>Reporter: Grigory Skvortsov
>Priority: Major
>
> I have the external hive table stored as orc. I want to work with timestamp 
> column in my table using pyspark.
> For example, I try this:
>  spark.sql('select id, time_ from mydb.table1`).show()
>  
>  Py4JJavaError: An error occurred while calling o2877.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 19, 172.29.14.241, executor 1): java.lang.ClassCastException: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long
> at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
> at 
> org.apache.spark.sql.catalyst.expressions.MutableLong.update(SpecificInternalRow.scala:148)
> at 
> org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:228)
> at 
> org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53(HiveInspectors.scala:730)
> at 
> org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53$adapted(HiveInspectors.scala:730)
> at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$.$anonfun$unwrapOrcStructs$4(OrcFileFormat.scala:351)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:96)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:127)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950)
> at scala.Option.f

[jira] [Commented] (SPARK-32894) Timestamp cast in exernal ocr table

2020-09-16 Thread Grigory Skvortsov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196783#comment-17196783
 ] 

Grigory Skvortsov commented on SPARK-32894:
---

>From hiveCli using following code:

> Timestamp cast in exernal ocr table
> ---
>
> Key: SPARK-32894
> URL: https://issues.apache.org/jira/browse/SPARK-32894
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
> Java 1.8
> Hadoop 3.3.0
> Hive 3.1.2
> Python 3.7 (from pyspark)
>Reporter: Grigory Skvortsov
>Priority: Major
>
> I have the external hive table stored as orc. I want to work with timestamp 
> column in my table using pyspark.
> For example, I try this:
>  spark.sql('select id, time_ from mydb.table1`).show()
>  
>  Py4JJavaError: An error occurred while calling o2877.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 19, 172.29.14.241, executor 1): java.lang.ClassCastException: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long
> at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
> at 
> org.apache.spark.sql.catalyst.expressions.MutableLong.update(SpecificInternalRow.scala:148)
> at 
> org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:228)
> at 
> org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53(HiveInspectors.scala:730)
> at 
> org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53$adapted(HiveInspectors.scala:730)
> at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$.$anonfun$unwrapOrcStructs$4(OrcFileFormat.scala:351)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:96)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:127)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950)
> at scala.Option.foreach(Option.scala:407)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152)
> at 
> org.apache

[jira] [Assigned] (SPARK-32899) Support submit application with user-defined cluster manager

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32899:


Assignee: (was: Apache Spark)

> Support submit application with user-defined cluster manager
> 
>
> Key: SPARK-32899
> URL: https://issues.apache.org/jira/browse/SPARK-32899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xianyang Liu
>Priority: Major
>
> We have supported users to define the customed cluster manager with 
> `ExternalClusterManager` trait. However, we can not submit the application 
> with `SparkSubmit`. This patch adds the support to submit applications with 
> user-defined cluster manager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32899) Support submit application with user-defined cluster manager

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196754#comment-17196754
 ] 

Apache Spark commented on SPARK-32899:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29770

> Support submit application with user-defined cluster manager
> 
>
> Key: SPARK-32899
> URL: https://issues.apache.org/jira/browse/SPARK-32899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xianyang Liu
>Priority: Major
>
> We have supported users to define the customed cluster manager with 
> `ExternalClusterManager` trait. However, we can not submit the application 
> with `SparkSubmit`. This patch adds the support to submit applications with 
> user-defined cluster manager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32899) Support submit application with user-defined cluster manager

2020-09-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196753#comment-17196753
 ] 

Apache Spark commented on SPARK-32899:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29770

> Support submit application with user-defined cluster manager
> 
>
> Key: SPARK-32899
> URL: https://issues.apache.org/jira/browse/SPARK-32899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xianyang Liu
>Priority: Major
>
> We have supported users to define the customed cluster manager with 
> `ExternalClusterManager` trait. However, we can not submit the application 
> with `SparkSubmit`. This patch adds the support to submit applications with 
> user-defined cluster manager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32899) Support submit application with user-defined cluster manager

2020-09-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32899:


Assignee: Apache Spark

> Support submit application with user-defined cluster manager
> 
>
> Key: SPARK-32899
> URL: https://issues.apache.org/jira/browse/SPARK-32899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xianyang Liu
>Assignee: Apache Spark
>Priority: Major
>
> We have supported users to define the customed cluster manager with 
> `ExternalClusterManager` trait. However, we can not submit the application 
> with `SparkSubmit`. This patch adds the support to submit applications with 
> user-defined cluster manager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32899) Support submit application with user-defined cluster manager

2020-09-16 Thread Xianyang Liu (Jira)
Xianyang Liu created SPARK-32899:


 Summary: Support submit application with user-defined cluster 
manager
 Key: SPARK-32899
 URL: https://issues.apache.org/jira/browse/SPARK-32899
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Xianyang Liu


We have supported users to define the customed cluster manager with 
`ExternalClusterManager` trait. However, we can not submit the application with 
`SparkSubmit`. This patch adds the support to submit applications with 
user-defined cluster manager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org