[jira] [Created] (SPARK-30215) Remove PrunedInMemoryFileIndex and merge its functionality into InMemoryFileIndex

2019-12-11 Thread Hu Fuwang (Jira)
Hu Fuwang created SPARK-30215:
-

 Summary: Remove PrunedInMemoryFileIndex and merge its 
functionality into InMemoryFileIndex
 Key: SPARK-30215
 URL: https://issues.apache.org/jira/browse/SPARK-30215
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hu Fuwang


PrunedInMemoryFileIndex is only used in CatalogFileIndex.filterPartitions, and 
its name is kind of confusing, we can completely merge its functionality into 
InMemoryFileIndex and remove the class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30214) Support COMMENT ON syntax

2019-12-11 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-30214:
-
Description: 
https://prestosql.io/docs/current/sql/comment.html
https://www.postgresql.org/docs/12/sql-comment.html

We are going to disable setting reserved properties by dbproperties or 
tblproperites directly, which needs a subclause in create syntax or specific 
alter commands

  was:
https://prestosql.io/docs/current/sql/comment.html
https://www.postgresql.org/docs/12/sql-comment.html

We are going to disable setting reserved properties by dbproperties or 
tblproperites directory, which need a subclause in create syntax or specific 
alter commands


> Support COMMENT ON syntax
> -
>
> Key: SPARK-30214
> URL: https://issues.apache.org/jira/browse/SPARK-30214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> https://prestosql.io/docs/current/sql/comment.html
> https://www.postgresql.org/docs/12/sql-comment.html
> We are going to disable setting reserved properties by dbproperties or 
> tblproperites directly, which needs a subclause in create syntax or specific 
> alter commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30216) Use python3 in Docker release image

2019-12-11 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-30216:
---

 Summary: Use python3 in Docker release image
 Key: SPARK-30216
 URL: https://issues.apache.org/jira/browse/SPARK-30216
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Yuming Wang
Assignee: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30104) global temp db name can be used as a table name under v2 catalog

2019-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30104.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26741
[https://github.com/apache/spark/pull/26741]

> global temp db name can be used as a table name under v2 catalog
> 
>
> Key: SPARK-30104
> URL: https://issues.apache.org/jira/browse/SPARK-30104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, 'global_temp' can be used in certain commands (CREATE) but not in 
> others (DESCRIBE) because catalog look up logic only considers the first 
> element of the multi-part name and always uses the session catalog if it is 
> set to 'global_temp'.
> For example:
> {code:java}
> // Assume "spark.sql.globalTempDatabase" is set to "global_temp".
> sql(s"CREATE TABLE testcat.t (id bigint, data string) USING foo")
> sql(s"CREATE TABLE testcat.global_temp (id bigint, data string) USING foo")
> sql("USE testcat")
> sql(s"DESCRIBE TABLE t").show
> +---+-+---+
> |   col_name|data_type|comment|
> +---+-+---+
> | id|   bigint|   |
> |   data|   string|   |
> |   | |   |
> | # Partitioning| |   |
> |Not partitioned| |   |
> +---+-+---+
> sql(s"DESCRIBE TABLE global_temp").show
> org.apache.spark.sql.AnalysisException: Table not found: global_temp;;
>   'DescribeTable 'UnresolvedV2Relation [global_temp], 
> org.apache.spark.sql.connector.InMemoryTableSessionCatalog@2f1af64f, 
> `global_temp`, false
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:47)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:46)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:122)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30104) global temp db name can be used as a table name under v2 catalog

2019-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30104:
---

Assignee: Terry Kim

> global temp db name can be used as a table name under v2 catalog
> 
>
> Key: SPARK-30104
> URL: https://issues.apache.org/jira/browse/SPARK-30104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> Currently, 'global_temp' can be used in certain commands (CREATE) but not in 
> others (DESCRIBE) because catalog look up logic only considers the first 
> element of the multi-part name and always uses the session catalog if it is 
> set to 'global_temp'.
> For example:
> {code:java}
> // Assume "spark.sql.globalTempDatabase" is set to "global_temp".
> sql(s"CREATE TABLE testcat.t (id bigint, data string) USING foo")
> sql(s"CREATE TABLE testcat.global_temp (id bigint, data string) USING foo")
> sql("USE testcat")
> sql(s"DESCRIBE TABLE t").show
> +---+-+---+
> |   col_name|data_type|comment|
> +---+-+---+
> | id|   bigint|   |
> |   data|   string|   |
> |   | |   |
> | # Partitioning| |   |
> |Not partitioned| |   |
> +---+-+---+
> sql(s"DESCRIBE TABLE global_temp").show
> org.apache.spark.sql.AnalysisException: Table not found: global_temp;;
>   'DescribeTable 'UnresolvedV2Relation [global_temp], 
> org.apache.spark.sql.connector.InMemoryTableSessionCatalog@2f1af64f, 
> `global_temp`, false
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:47)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:46)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:122)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27506) Function `from_avro` doesn't allow deserialization of data using other compatible schemas

2019-12-11 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-27506.

  Assignee: Fokko Driesprong
Resolution: Fixed

The issue is resolved by https://github.com/apache/spark/pull/26780

> Function `from_avro` doesn't allow deserialization of data using other 
> compatible schemas
> -
>
> Key: SPARK-27506
> URL: https://issues.apache.org/jira/browse/SPARK-27506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gianluca Amori
>Assignee: Fokko Driesprong
>Priority: Major
>
>  SPARK-24768 and subtasks introduced support to read and write Avro data by 
> parsing a binary column of Avro format and converting it into its 
> corresponding catalyst value (and viceversa).
>  
> The current implementation has the limitation of requiring deserialization of 
> an event with the exact same schema with which it was serialized. This breaks 
> one of the most important features of Avro, schema evolution 
> [https://docs.confluent.io/current/schema-registry/avro.html] - most 
> importantly, the ability to read old data with a newer (compatible) schema 
> without breaking the consumer.
>  
> The GenericDatumReader in the Avro library already supports passing an 
> optional *writer's schema* (the schema with which the record was serialized) 
> alongside a mandatory *reader's schema* (the schema with which the record is 
> going to be deserialized). The proposed change is to do the same in the 
> from_avro function, allowing the possibility to pass an optional writer's 
> schema to be used in the deserialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27506) Function `from_avro` doesn't allow deserialization of data using other compatible schemas

2019-12-11 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-27506:
-
Fix Version/s: 3.0.0

> Function `from_avro` doesn't allow deserialization of data using other 
> compatible schemas
> -
>
> Key: SPARK-27506
> URL: https://issues.apache.org/jira/browse/SPARK-27506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gianluca Amori
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 3.0.0
>
>
>  SPARK-24768 and subtasks introduced support to read and write Avro data by 
> parsing a binary column of Avro format and converting it into its 
> corresponding catalyst value (and viceversa).
>  
> The current implementation has the limitation of requiring deserialization of 
> an event with the exact same schema with which it was serialized. This breaks 
> one of the most important features of Avro, schema evolution 
> [https://docs.confluent.io/current/schema-registry/avro.html] - most 
> importantly, the ability to read old data with a newer (compatible) schema 
> without breaking the consumer.
>  
> The GenericDatumReader in the Avro library already supports passing an 
> optional *writer's schema* (the schema with which the record was serialized) 
> alongside a mandatory *reader's schema* (the schema with which the record is 
> going to be deserialized). The proposed change is to do the same in the 
> from_avro function, allowing the possibility to pass an optional writer's 
> schema to be used in the deserialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30217) Improve partition pruning rule to enforce idempotence

2019-12-11 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-30217:
--

 Summary: Improve partition pruning rule to enforce idempotence
 Key: SPARK-30217
 URL: https://issues.apache.org/jira/browse/SPARK-30217
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Manu Zhang


The rule was added to blacklist in 
https://github.com/apache/spark/commit/a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-a636a87d8843eeccca90140be91d4fafR52



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2019-12-11 Thread Francesco Cavrini (Jira)
Francesco Cavrini created SPARK-30218:
-

 Summary: Columns used in inequality conditions for joins not 
resolved correctly in case of common lineage
 Key: SPARK-30218
 URL: https://issues.apache.org/jira/browse/SPARK-30218
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.1
Reporter: Francesco Cavrini


When columns from different data-frames that have a common lineage are used in 
inequality conditions in joins, they are not resolved correctly. In particular, 
both the column from the left DF and the one from the right DF are resolved to 
the same column, thus making the inequality condition either always satisfied 
or always not-satisfied.

Minimal example to reproduce follows.

{code:python}
import pyspark.sql.functions as F

data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
["id", "kind", "timestamp"])

df_left = data.where(F.col("kind") == "A").alias("left")
df_right = data.where(F.col("kind") == "B").alias("right")

conds = [df_left["id"] == df_right["id"]]
conds.append(df_right["timestamp"].between(df_left["timestamp"], 
df_left["timestamp"] + 2))

res = df_left.join(df_right, conds, how="left")
{code}

The result is:

| id|kind|timestamp| id|kind|timestamp|
|id1|   A|0|id1|   B|1|
|id1|   A|0|id1|   B|5|
|id1|   A|1|id1|   B|1|
|id1|   A|1|id1|   B|5|
|id2|   A|2|id2|   B|   10|
|id2|   A|3|id2|   B|   10|

which violates the condition that the timestamp from the right DF should be 
between df_left["timestamp"] and  df_left["timestamp"] + 2.

The plan shows the problem in the column resolution.

{code:bash}
== Parsed Logical Plan ==
Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
(timestamp#2L <= (timestamp#2L + cast(2 as bigint)
:- SubqueryAlias `left`
:  +- Filter (kind#1 = A)
: +- LogicalRDD [id#0, kind#1, timestamp#2L], false
+- SubqueryAlias `right`
   +- Filter (kind#37 = B)
  +- LogicalRDD [id#36, kind#37, timestamp#38L], false
{code}

Note, the columns used in the equality condition of the join have been 
correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2019-12-11 Thread Francesco Cavrini (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993379#comment-16993379
 ] 

Francesco Cavrini commented on SPARK-30218:
---

Potentially related issue https://issues.apache.org/jira/browse/SPARK-24780

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.1
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30219) Support Filter expression referencing the outer query

2019-12-11 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-30219:
--

 Summary: Support Filter expression referencing the outer query
 Key: SPARK-30219
 URL: https://issues.apache.org/jira/browse/SPARK-30219
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30219) Support Filter expression referencing the outer query

2019-12-11 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30219:
---
Description: 
Spark SQL cannot supports a SQL with nested aggregate as below:

 
{code:java}
select (select count(*) filter (where outer_c <> 0)
  from (values (1)) t0(inner_c))
from (values (2),(3)) t1(outer_c);{code}
 

And Spark will throw exception as follows:
{code:java}
org.apache.spark.sql.AnalysisException
Expressions referencing the outer query are not supported outside of 
WHERE/HAVING clauses:
Aggregate [count(1) AS count(1)#xL]
+- Project [col1#x AS inner_c#x]
 +- SubqueryAlias `t0`
 +- LocalRelation [col1#x]{code}
But PostgreSQL supports this syntax.

 
{code:java}
select (select count(*) filter (where outer_c <> 0)
  from (values (1)) t0(inner_c))
from (values (2),(3)) t1(outer_c); -- outer query is aggregation query
 count 
---
 2
(1 row){code}
 

> Support Filter expression referencing the outer query
> -
>
> Key: SPARK-30219
> URL: https://issues.apache.org/jira/browse/SPARK-30219
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL cannot supports a SQL with nested aggregate as below:
>  
> {code:java}
> select (select count(*) filter (where outer_c <> 0)
>   from (values (1)) t0(inner_c))
> from (values (2),(3)) t1(outer_c);{code}
>  
> And Spark will throw exception as follows:
> {code:java}
> org.apache.spark.sql.AnalysisException
> Expressions referencing the outer query are not supported outside of 
> WHERE/HAVING clauses:
> Aggregate [count(1) AS count(1)#xL]
> +- Project [col1#x AS inner_c#x]
>  +- SubqueryAlias `t0`
>  +- LocalRelation [col1#x]{code}
> But PostgreSQL supports this syntax.
>  
> {code:java}
> select (select count(*) filter (where outer_c <> 0)
>   from (values (1)) t0(inner_c))
> from (values (2),(3)) t1(outer_c); -- outer query is aggregation query
>  count 
> ---
>  2
> (1 row){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30220) Support Filter expression uses IN/EXISTS predicate sub-queries

2019-12-11 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-30220:
--

 Summary: Support Filter expression uses IN/EXISTS predicate 
sub-queries
 Key: SPARK-30220
 URL: https://issues.apache.org/jira/browse/SPARK-30220
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30220) Support Filter expression uses IN/EXISTS predicate sub-queries

2019-12-11 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30220:
---
Description: 
Spark SQL cannot supports a SQL with nested aggregate as below:

 
{code:java}
select sum(unique1) FILTER (WHERE
 unique1 IN (SELECT unique1 FROM onek where unique1 < 100)) FROM tenk1;{code}
 

And Spark will throw exception as follows:

 
{code:java}
org.apache.spark.sql.AnalysisException
IN/EXISTS predicate sub-queries can only be used in Filter/Join and a few 
commands: Aggregate [sum(cast(unique1#x as bigint)) AS sum(unique1)#xL]
: +- Project [unique1#x]
: +- Filter (unique1#x < 100)
: +- SubqueryAlias `onek`
: +- RelationV2[unique1#x, unique2#x, two#x, four#x, ten#x, twenty#x, 
hundred#x, thousand#x, twothousand#x, fivethous#x, tenthous#x, odd#x, even#x, 
stringu1#x, stringu2#x, string4#x] csv 
file:/home/xitong/code/gengjiaan/spark/sql/core/target/scala-2.12/test-classes/test-data/postgresql/onek.data
+- SubqueryAlias `tenk1`
 +- RelationV2[unique1#x, unique2#x, two#x, four#x, ten#x, twenty#x, hundred#x, 
thousand#x, twothousand#x, fivethous#x, tenthous#x, odd#x, even#x, stringu1#x, 
stringu2#x, string4#x] csv 
file:/home/xitong/code/gengjiaan/spark/sql/core/target/scala-2.12/test-classes/test-data/postgresql/tenk.data{code}
 

But PostgreSQL supports this syntax.
{code:java}
select sum(unique1) FILTER (WHERE
 unique1 IN (SELECT unique1 FROM onek where unique1 < 100)) FROM tenk1;
 sum 
--
 4950
(1 row){code}

> Support Filter expression uses IN/EXISTS predicate sub-queries
> --
>
> Key: SPARK-30220
> URL: https://issues.apache.org/jira/browse/SPARK-30220
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL cannot supports a SQL with nested aggregate as below:
>  
> {code:java}
> select sum(unique1) FILTER (WHERE
>  unique1 IN (SELECT unique1 FROM onek where unique1 < 100)) FROM tenk1;{code}
>  
> And Spark will throw exception as follows:
>  
> {code:java}
> org.apache.spark.sql.AnalysisException
> IN/EXISTS predicate sub-queries can only be used in Filter/Join and a few 
> commands: Aggregate [sum(cast(unique1#x as bigint)) AS sum(unique1)#xL]
> : +- Project [unique1#x]
> : +- Filter (unique1#x < 100)
> : +- SubqueryAlias `onek`
> : +- RelationV2[unique1#x, unique2#x, two#x, four#x, ten#x, twenty#x, 
> hundred#x, thousand#x, twothousand#x, fivethous#x, tenthous#x, odd#x, even#x, 
> stringu1#x, stringu2#x, string4#x] csv 
> file:/home/xitong/code/gengjiaan/spark/sql/core/target/scala-2.12/test-classes/test-data/postgresql/onek.data
> +- SubqueryAlias `tenk1`
>  +- RelationV2[unique1#x, unique2#x, two#x, four#x, ten#x, twenty#x, 
> hundred#x, thousand#x, twothousand#x, fivethous#x, tenthous#x, odd#x, even#x, 
> stringu1#x, stringu2#x, string4#x] csv 
> file:/home/xitong/code/gengjiaan/spark/sql/core/target/scala-2.12/test-classes/test-data/postgresql/tenk.data{code}
>  
> But PostgreSQL supports this syntax.
> {code:java}
> select sum(unique1) FILTER (WHERE
>  unique1 IN (SELECT unique1 FROM onek where unique1 < 100)) FROM tenk1;
>  sum 
> --
>  4950
> (1 row){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30219) Support Filter expression reference the outer query

2019-12-11 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30219:
---
Summary: Support Filter expression reference the outer query  (was: Support 
Filter expression referencing the outer query)

> Support Filter expression reference the outer query
> ---
>
> Key: SPARK-30219
> URL: https://issues.apache.org/jira/browse/SPARK-30219
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL cannot supports a SQL with nested aggregate as below:
>  
> {code:java}
> select (select count(*) filter (where outer_c <> 0)
>   from (values (1)) t0(inner_c))
> from (values (2),(3)) t1(outer_c);{code}
>  
> And Spark will throw exception as follows:
> {code:java}
> org.apache.spark.sql.AnalysisException
> Expressions referencing the outer query are not supported outside of 
> WHERE/HAVING clauses:
> Aggregate [count(1) AS count(1)#xL]
> +- Project [col1#x AS inner_c#x]
>  +- SubqueryAlias `t0`
>  +- LocalRelation [col1#x]{code}
> But PostgreSQL supports this syntax.
>  
> {code:java}
> select (select count(*) filter (where outer_c <> 0)
>   from (values (1)) t0(inner_c))
> from (values (2),(3)) t1(outer_c); -- outer query is aggregation query
>  count 
> ---
>  2
> (1 row){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27217) Nested schema pruning doesn't work for aggregation e.g. `sum`.

2019-12-11 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993460#comment-16993460
 ] 

Aman Omer commented on SPARK-27217:
---

I am able to reproduce this bug on spark-shell. Working on this

> Nested schema pruning doesn't work for aggregation e.g. `sum`.
> --
>
> Key: SPARK-27217
> URL: https://issues.apache.org/jira/browse/SPARK-27217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: colin fang
>Priority: Major
>
> Since SPARK-4502 is fixed,  I would expect queries such as `select sum(b.x)` 
> doesn't have to read other nested fields.
> {code:python}   
>  rdd = spark.range(1000).rdd.map(lambda x: [x.id+3, [x.id+1, x.id-1]])
> df = spark.createDataFrame(add, schema='a:int,b:struct')
> df.repartition(1).write.mode('overwrite').parquet('test.parquet')
> df = spark.read.parquet('test.parquet')
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
> df.select('b.x').explain()
> # ReadSchema: struct>
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'false')
> df.select('b.x').explain()
> # ReadSchema: struct>
> spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
> df.selectExpr('sum(b.x)').explain()
> #  ReadSchema: struct>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30221) Enhanced implementation of PrometheusPushGateWaySink

2019-12-11 Thread Forward Xu (Jira)
Forward Xu created SPARK-30221:
--

 Summary: Enhanced implementation of PrometheusPushGateWaySink
 Key: SPARK-30221
 URL: https://issues.apache.org/jira/browse/SPARK-30221
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Forward Xu


Enhanced implementation of PrometheusPushGateWaySink.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30207) Enhance the SQL NULL Semantics document

2019-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30207:
---

Assignee: Yuanjian Li

> Enhance the SQL NULL Semantics document
> ---
>
> Key: SPARK-30207
> URL: https://issues.apache.org/jira/browse/SPARK-30207
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> Enhancement of the SQL NULL Semantics document: sql-ref-null-semantics.html.
> Clarify the behavior of `UNKNOW` for both `EXIST` and `IN` operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30207) Enhance the SQL NULL Semantics document

2019-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30207.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26837
[https://github.com/apache/spark/pull/26837]

> Enhance the SQL NULL Semantics document
> ---
>
> Key: SPARK-30207
> URL: https://issues.apache.org/jira/browse/SPARK-30207
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> Enhancement of the SQL NULL Semantics document: sql-ref-null-semantics.html.
> Clarify the behavior of `UNKNOW` for both `EXIST` and `IN` operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30222) Still getting KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKe

2019-12-11 Thread Shyam (Jira)
Shyam created SPARK-30222:
-

 Summary: Still getting KafkaConsumer cache hitting max capacity of 
64, removing consumer for CacheKe
 Key: SPARK-30222
 URL: https://issues.apache.org/jira/browse/SPARK-30222
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.1
 Environment: {{Below are the logs.}}

2019-12-11 08:33:31,504 [Executor task launch worker for task 1050] WARN 
org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
max capacity of 64, removing consumer for 
CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-21)
2019-12-11 08:33:32,493 [Executor task launch worker for task 1051] WARN 
org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
max capacity of 64, removing consumer for 
CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-5)
2019-12-11 08:33:32,570 [Executor task launch worker for task 1052] WARN 
org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
max capacity of 64, removing consumer for 
CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-9)
2019-12-11 08:33:33,441 [Executor task launch worker for task 1053] WARN 
org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
max capacity of 64, removing consumer for 
CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-17)
2019-12-11 08:33:33,619 [Executor task launch worker for task 1054] WARN 
org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
max capacity of 64, removing consumer for 
CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-2)
2019-12-11 08:33:34,474 [Executor task launch worker for task 1055] WARN 
org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
max capacity of 64, removing consumer for 
CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-10)
2019-12-11 08:33:35,006 [Executor task launch worker for task 1056] WARN 
org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
max capacity of 64, removing consumer for 
CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-6)
2019-12-11 08:33:36,326 [Executor task launch worker for task 1057] WARN 
org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
max capacity of 64, removing consumer for 
CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-14)
2019-12-11 08:33:36,634 [Executor task launch worker for task 1058] WARN 
org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
max capacity of 64, removing consumer for 
CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-0)
2019-12-11 08:33:37,496 [Executor task launch worker for task 1059] WARN 
org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
max capacity of 64, removing consumer for 
CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-19)
2019-12-11 08:33:39,183 [stream execution thread for [id = 
b3aec196-e4f2-4ef9-973b-d5685eba917e, runId = 
5c35a63a-16ad-4899-b732-1019397770bd]] WARN 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor - Current batch 
is falling behind. The trigger interval is 15000 milliseconds, but spent 63438 
milliseconds
Reporter: Shyam
 Fix For: 2.4.1


Me using spark-sql-2.4.1 version with Kafka 0.10 v.

While I try to consume data by consumer. it gives error below even after 
setting 

.option("spark.sql.kafkaConsumerCache.capacity",128)

 

{{Dataset df = sparkSession}}

{{       .readStream()}}

{{       .format("kafka")}}

{{       .option("kafka.bootstrap.servers", SERVERS)}}

{{       .option("subscribe", TOPIC) }}{{}}

{{       .option("spark.sql.kafkaConsumerCache.capacity",128)   }}

{{ }}

{{       .load();}}

{{}}

{{}}

{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25466) Documentation does not specify how to set Kafka consumer cache capacity for SS

2019-12-11 Thread Shyam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993557#comment-16993557
 ] 

Shyam commented on SPARK-25466:
---

[~gsomogyi] I raised a ticket https://issues.apache.org/jira/browse/SPARK-30222 
please let me know if anything specific are you looking for , to suggest a fix.

> Documentation does not specify how to set Kafka consumer cache capacity for SS
> --
>
> Key: SPARK-25466
> URL: https://issues.apache.org/jira/browse/SPARK-25466
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Patrick McGloin
>Priority: Minor
>
> When hitting this warning with SS:
> 19-09-2018 12:05:27 WARN  CachedKafkaConsumer:66 - KafkaConsumer cache 
> hitting max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30)
> If you Google you get to this page:
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> Which is for Spark Streaming and says to use this config item to adjust the 
> capacity: "spark.streaming.kafka.consumer.cache.maxCapacity".
> This is a bit confusing as SS uses a different config item: 
> "spark.sql.kafkaConsumerCache.capacity"
> Perhaps the SS Kafka documentation should talk about the consumer cache 
> capacity?  Perhaps here?
> https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
> Or perhaps the warning message should reference the config item.  E.g
> 19-09-2018 12:05:27 WARN  CachedKafkaConsumer:66 - KafkaConsumer cache 
> hitting max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30).
>   *The cache size can be adjusted with the setting 
> "spark.sql.kafkaConsumerCache.capacity".*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29460) Improve tooltip for Job Tab

2019-12-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29460:


Assignee: pavithra ramachandran

> Improve tooltip for Job Tab
> ---
>
> Key: SPARK-29460
> URL: https://issues.apache.org/jira/browse/SPARK-29460
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: pavithra ramachandran
>Priority: Major
>
> [~LI,Xiao] I see there is inconsistency in the tool tip added for the column 
> across the tab. like Duration column in Job Tab does not have tooltip but 
> Duration Column in JDBC/ODBC Server tab has tooltip. 
> I submitted this Jira to handle this inconsistency in Job Tab Table Column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29460) Improve tooltip for Job Tab

2019-12-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29460:
-
Priority: Minor  (was: Major)

> Improve tooltip for Job Tab
> ---
>
> Key: SPARK-29460
> URL: https://issues.apache.org/jira/browse/SPARK-29460
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: pavithra ramachandran
>Priority: Minor
> Fix For: 3.0.0
>
>
> [~LI,Xiao] I see there is inconsistency in the tool tip added for the column 
> across the tab. like Duration column in Job Tab does not have tooltip but 
> Duration Column in JDBC/ODBC Server tab has tooltip. 
> I submitted this Jira to handle this inconsistency in Job Tab Table Column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29460) Improve tooltip for Job Tab

2019-12-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29460.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26384
[https://github.com/apache/spark/pull/26384]

> Improve tooltip for Job Tab
> ---
>
> Key: SPARK-29460
> URL: https://issues.apache.org/jira/browse/SPARK-29460
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: pavithra ramachandran
>Priority: Major
> Fix For: 3.0.0
>
>
> [~LI,Xiao] I see there is inconsistency in the tool tip added for the column 
> across the tab. like Duration column in Job Tab does not have tooltip but 
> Duration Column in JDBC/ODBC Server tab has tooltip. 
> I submitted this Jira to handle this inconsistency in Job Tab Table Column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29457) Improve tooltip information for Environment Tab

2019-12-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29457.
--
Resolution: Not A Problem

The two columns are "key" and "value". I can't imagine this needs 
clarification, so I'm closing this.

> Improve tooltip information for Environment Tab
> ---
>
> Key: SPARK-29457
> URL: https://issues.apache.org/jira/browse/SPARK-29457
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29455) Improve tooltip information for Stages Tab

2019-12-11 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993643#comment-16993643
 ] 

Sean R. Owen commented on SPARK-29455:
--

[~pavithraramachandran] this is the last tooltip JIRA left. Do you want to take 
a look?

> Improve tooltip information for Stages Tab
> --
>
> Key: SPARK-29455
> URL: https://issues.apache.org/jira/browse/SPARK-29455
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29455) Improve tooltip information for Stages Tab

2019-12-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29455:
-
Priority: Minor  (was: Major)

> Improve tooltip information for Stages Tab
> --
>
> Key: SPARK-29455
> URL: https://issues.apache.org/jira/browse/SPARK-29455
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30223) queries in thrift server may read wrong SQL configs

2019-12-11 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993645#comment-16993645
 ] 

Wenchen Fan commented on SPARK-30223:
-

cc [~yumwang]

> queries in thrift server may read wrong SQL configs
> ---
>
> Key: SPARK-30223
> URL: https://issues.apache.org/jira/browse/SPARK-30223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The Spark thrift server creates many SparkSessions to serve requests, and the 
> thrift server serves requests using a single thread. One thread can only have 
> one active SparkSession, so SQLCong.get can't get the proper conf from the 
> session that runs the query.
> Whenever we issue an action on a SparkSession, we should set this session as 
> active session, e.g. `SparkSession.sql`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30223) queries in thrift server may read wrong SQL configs

2019-12-11 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-30223:
---

 Summary: queries in thrift server may read wrong SQL configs
 Key: SPARK-30223
 URL: https://issues.apache.org/jira/browse/SPARK-30223
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan


The Spark thrift server creates many SparkSessions to serve requests, and the 
thrift server serves requests using a single thread. One thread can only have 
one active SparkSession, so SQLCong.get can't get the proper conf from the 
session that runs the query.

Whenever we issue an action on a SparkSession, we should set this session as 
active session, e.g. `SparkSession.sql`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30224) Configurable field and record separators for filters

2019-12-11 Thread thierry accart (Jira)
thierry accart created SPARK-30224:
--

 Summary: Configurable field and record separators for filters
 Key: SPARK-30224
 URL: https://issues.apache.org/jira/browse/SPARK-30224
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.4
Reporter: thierry accart


The method addfilter of JettyUtils is parsing sparkConf using strict field and 
record separator: at this time, it's not possible to have = or , signs in a 
parameter value. This brings impossibility to configure for exampe Hadoop's 
LDAP authenticator filter, as some of its parameters contain comma and equal 
signs.

To fix this problem, there are two solutions:

either we allow spark..params.= in sparkConf
or 
we allow spark..params.rs= and 
spark..params.fs= 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29864) Strict parsing of day-time strings to intervals

2019-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29864.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26473
[https://github.com/apache/spark/pull/26473]

> Strict parsing of day-time strings to intervals
> ---
>
> Key: SPARK-29864
> URL: https://issues.apache.org/jira/browse/SPARK-29864
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the IntervalUtils.fromDayTimeString() method does not takes into 
> account the left bound `from` and truncates the result using the right bound 
> `to`. The method should respect to the bounds specified by an user.
> Oracle and MySQL respect to user's bounds, see 
> https://github.com/apache/spark/pull/26358#issuecomment-551942719 and 
> https://github.com/apache/spark/pull/26358#issuecomment-549272475 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29920) Parsing failure on interval '20 15' day to hour

2019-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29920.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26473
[https://github.com/apache/spark/pull/26473]

> Parsing failure on interval '20 15' day to hour
> ---
>
> Key: SPARK-29920
> URL: https://issues.apache.org/jira/browse/SPARK-29920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> spark-sql> select interval '20 15' day to hour;
> Error in query:
> requirement failed: Interval string must match day-time format of 'd 
> h:m:s.n': 20 15(line 1, pos 16)
> == SQL ==
> select interval '20 15' day to hour
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29920) Parsing failure on interval '20 15' day to hour

2019-12-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29920:
---

Assignee: Maxim Gekk

> Parsing failure on interval '20 15' day to hour
> ---
>
> Key: SPARK-29920
> URL: https://issues.apache.org/jira/browse/SPARK-29920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> {code:sql}
> spark-sql> select interval '20 15' day to hour;
> Error in query:
> requirement failed: Interval string must match day-time format of 'd 
> h:m:s.n': 20 15(line 1, pos 16)
> == SQL ==
> select interval '20 15' day to hour
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29371) Support interval field values with fractional parts

2019-12-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29371.
--
Resolution: Won't Fix

> Support interval field values with fractional parts
> ---
>
> Key: SPARK-29371
> URL: https://issues.apache.org/jira/browse/SPARK-29371
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> In PostgreSQL, field values can have fractional parts; for example '1.5 week' 
> or '01:02:03.45'. Such input is converted to the appropriate number of 
> months, days, and seconds for storage. When this would result in a fractional 
> number of months or days, the fraction is added to the lower-order fields 
> using the conversion factors 1 month = 30 days and 1 day = 24 hours. For 
> example, '1.5 month' becomes 1 month and 15 days. Only seconds will ever be 
> shown as fractional on output. See 
> https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT
> For example:
> {code}
> maxim=# SELECT INTERVAL '1.5 months' AS "One month 15 days";
>  One month 15 days 
> ---
>  1 mon 15 days
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30127) UDF should work for case class like Dataset operations

2019-12-11 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993750#comment-16993750
 ] 

Simeon Simeonov commented on SPARK-30127:
-

The ability to transform one or more columns with native code, ignoring the 
rest of the schema, is sorely missed. Some may think that Dataset operations 
such as {{map}}/{{flatMap}} could be used to work around the need for this 
feature. That's true only in the cases where the Scala type of the full schema 
is (a) known in advance and (b) unchanging, which is impractical in many 
real-world use cases. Even in the cases where {{map}}/{{flatMap}} could work, 
there will be a performance cost to converting the entire row to/from internal 
row format, as opposed to just the columns that are needed.

However, UDFs are only one modality for exposing this capability and, given the 
Scala registration requirement for the UDFs, not necessarily the best one. If 
we add this capability for UDFs, I would suggest we also enhance the Dataset 
API with column-level {{map}}/{{flatMap}} functionality, e.g.,
{code:scala}
def flatMapColumns[C: Encoder, U: Encoder](colName: String)(func: C => 
TraversableOnce[U]): Dataset[U]
{code}
While multiple columns can be passed in using {{functions.struct(col1, col2, 
...)}} and mapped to {{C}} that is {{TupleN}}, if that costs additional 
processing (internal buffer copying, serialization/deserialization), it would 
be trivial (and transparent to users if we rename {{colName}} above to 
{{colName1}}) to add versions for 2 and 3 columns, which would cover 99+% of 
all uses:
{code:scala}
def flatMapColumns[C1, C2, U](colName1: String, colName2: String)
  (func: (C1, C2) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2)], evU: Encoder[U]): Dataset[U]

def flatMapColumns[C1, C2, C3, U](colName1: String, colName2: String, colName3: 
String)
  (func: (C1, C2, C3) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2, C3)], evU: Encoder[U]): Dataset[U]
{code}
[~cloud_fan] There are at least three benefits to adding this capability.
 # It provides a fundamental missing capability to the Dataset API: 
transforming data while knowing only part of the schema.
 # It makes use from Java more convenient, without the need for {{TypeTag}}, 
while making it consistent with {{map}}/{{flatMap}} behavior (via 
{{MapFunction}}/{{FlatmapFunction}}). Given Java's popularity, this is a big 
plus.
 # Unless I am mistaken, it may allow for more optimization than using UDFs.

> UDF should work for case class like Dataset operations
> --
>
> Key: SPARK-30127
> URL: https://issues.apache.org/jira/browse/SPARK-30127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark UDF can only work on data types like java.lang.String, 
> o.a.s.sql.Row, Seq[_], etc. This is inconvenient if you want to apply an 
> operation on one column, and the column is struct type. You must access data 
> from a Row object, instead of your domain object like Dataset operations. It 
> will be great if UDF can work on types that are supported by Dataset, e.g. 
> case classes.
> Note that, there are multiple ways to register a UDF, and it's only possible 
> to support this feature if the UDF is registered using Scala API that 
> provides type tag, e.g. `def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, 
> RT])`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30127) UDF should work for case class like Dataset operations

2019-12-11 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993750#comment-16993750
 ] 

Simeon Simeonov edited comment on SPARK-30127 at 12/11/19 5:42 PM:
---

The ability to transform one or more columns with native code, ignoring the 
rest of the schema, is sorely missed. Some may think that Dataset operations 
such as {{map}}/{{flatMap}} could be used to work around the need for this 
feature. That's true only in the cases where the Scala type of the full schema 
is (a) known in advance and (b) unchanging, which is impractical in many 
real-world use cases. Even in the cases where {{map}}/{{flatMap}} could work, 
there will be a performance cost to converting the entire row to/from internal 
row format, as opposed to just the columns that are needed.

However, UDFs are only one modality for exposing this capability and, given the 
Scala registration requirement for the UDFs, not necessarily the best one. If 
we add this capability for UDFs, I would suggest we also enhance the Dataset 
API with column-level {{map}}/{{flatMap}} functionality, e.g.,
{code:scala}
def flatMapColumn[C: Encoder, U: Encoder](colName: String)(func: C => 
TraversableOnce[U]): Dataset[U]
{code}
While multiple columns can be passed in using {{functions.struct(col1, col2, 
...)}} and mapped to {{C}} that is {{TupleN}}, if that costs additional 
processing (internal buffer copying, serialization/deserialization), it would 
be trivial (and transparent to users if we rename {{flatMapColumn}} to 
{{flatMapColumns}} and {{colName}} to {{colName1}} above) to add versions for 2 
and 3 columns, which would cover 99+% of all uses:
{code:scala}
def flatMapColumns[C1, C2, U](colName1: String, colName2: String)
  (func: (C1, C2) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2)], evU: Encoder[U]): Dataset[U]

def flatMapColumns[C1, C2, C3, U](colName1: String, colName2: String, colName3: 
String)
  (func: (C1, C2, C3) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2, C3)], evU: Encoder[U]): Dataset[U]
{code}
[~cloud_fan] There are at least three benefits to adding this capability.
 # It provides a fundamental missing capability to the Dataset API: 
transforming data while knowing only part of the schema.
 # It makes use from Java more convenient, without the need for {{TypeTag}}, 
while making it consistent with {{map}}/{{flatMap}} behavior (via 
{{MapFunction}}/{{FlatmapFunction}}). Given Java's popularity, this is a big 
plus.
 # Unless I am mistaken, it may allow for more optimization than using UDFs.


was (Author: simeons):
The ability to transform one or more columns with native code, ignoring the 
rest of the schema, is sorely missed. Some may think that Dataset operations 
such as {{map}}/{{flatMap}} could be used to work around the need for this 
feature. That's true only in the cases where the Scala type of the full schema 
is (a) known in advance and (b) unchanging, which is impractical in many 
real-world use cases. Even in the cases where {{map}}/{{flatMap}} could work, 
there will be a performance cost to converting the entire row to/from internal 
row format, as opposed to just the columns that are needed.

However, UDFs are only one modality for exposing this capability and, given the 
Scala registration requirement for the UDFs, not necessarily the best one. If 
we add this capability for UDFs, I would suggest we also enhance the Dataset 
API with column-level {{map}}/{{flatMap}} functionality, e.g.,
{code:scala}
def flatMapColumns[C: Encoder, U: Encoder](colName: String)(func: C => 
TraversableOnce[U]): Dataset[U]
{code}
While multiple columns can be passed in using {{functions.struct(col1, col2, 
...)}} and mapped to {{C}} that is {{TupleN}}, if that costs additional 
processing (internal buffer copying, serialization/deserialization), it would 
be trivial (and transparent to users if we rename {{colName}} above to 
{{colName1}}) to add versions for 2 and 3 columns, which would cover 99+% of 
all uses:
{code:scala}
def flatMapColumns[C1, C2, U](colName1: String, colName2: String)
  (func: (C1, C2) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2)], evU: Encoder[U]): Dataset[U]

def flatMapColumns[C1, C2, C3, U](colName1: String, colName2: String, colName3: 
String)
  (func: (C1, C2, C3) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2, C3)], evU: Encoder[U]): Dataset[U]
{code}
[~cloud_fan] There are at least three benefits to adding this capability.
 # It provides a fundamental missing capability to the Dataset API: 
transforming data while knowing only part of the schema.
 # It makes use from Java more convenient, without the need for {{TypeTag}}, 
while making it consistent with {{map}}/{{flatMap}} behavior (via 
{{MapFunction}}/{{FlatmapFunction}}). Given Java's popularity, this is a big 
plus.
 # Unless I am mistaken, it may allow for more opti

[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti

2019-12-11 Thread Mairtin Deady (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993759#comment-16993759
 ] 

Mairtin Deady commented on SPARK-25250:
---

[~cloud_fan] Has another ticket been opened for this? If so can you share link 
please? I am interested to know the plans for releasing fix for this bug.

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30127) UDF should work for case class like Dataset operations

2019-12-11 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993750#comment-16993750
 ] 

Simeon Simeonov edited comment on SPARK-30127 at 12/11/19 6:01 PM:
---

The ability to transform one or more columns with native code, ignoring the 
rest of the schema, is sorely missed. Some may think that Dataset operations 
such as {{map}}/{{flatMap}} could be used to work around the need for this 
feature. That's true only in the cases where the Scala type of the full schema 
is (a) known in advance and (b) unchanging, which is impractical in many 
real-world use cases. Even in the cases where {{map}}/{{flatMap}} could work, 
there will be a performance cost to converting the entire row to/from internal 
row format, as opposed to just the columns that are needed.

However, UDFs are only one modality for exposing this capability and, given the 
Scala registration requirement for the UDFs, not necessarily the best one. If 
we add this capability for UDFs, I would suggest we also enhance the Dataset 
API with column-level {{map}}/{{flatMap}} functionality, e.g.,
{code:scala}
def flatMapColumn[C: Encoder, U: Encoder](colName: String, resultColName: 
String)
  (func: C => TraversableOnce[U]): Dataset[U]
{code}
While multiple columns can be passed in using {{functions.struct(col1, col2, 
...)}} and mapped to {{C}} that is {{TupleN}}, if that costs additional 
processing (internal buffer copying, serialization/deserialization), it would 
be trivial (and transparent to users if we rename {{flatMapColumn}} to 
{{flatMapColumns}} and {{colName}} to {{colName1}} above) to add versions for 2 
and 3 columns, which would cover 99+% of all uses:
{code:scala}
def flatMapColumns[C1, C2, U](colName1: String, colName2: String, 
resultColName: String)
  (func: (C1, C2) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2)], evU: Encoder[U]): Dataset[U]

def flatMapColumns[C1, C2, C3, U](colName1: String, colName2: String, colName3: 
String, resultColName: String)
  (func: (C1, C2, C3) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2, C3)], evU: Encoder[U]): Dataset[U]
{code}
[~cloud_fan] There are at least three benefits to adding this capability.
 # It provides a fundamental missing capability to the Dataset API: 
transforming data while knowing only part of the schema.
 # It makes use from Java more convenient, without the need for {{TypeTag}}, 
while making it consistent with {{map}}/{{flatMap}} behavior (via 
{{MapFunction}}/{{FlatmapFunction}}). Given Java's popularity, this is a big 
plus.
 # Unless I am mistaken, it may allow for more optimization than using UDFs.


was (Author: simeons):
The ability to transform one or more columns with native code, ignoring the 
rest of the schema, is sorely missed. Some may think that Dataset operations 
such as {{map}}/{{flatMap}} could be used to work around the need for this 
feature. That's true only in the cases where the Scala type of the full schema 
is (a) known in advance and (b) unchanging, which is impractical in many 
real-world use cases. Even in the cases where {{map}}/{{flatMap}} could work, 
there will be a performance cost to converting the entire row to/from internal 
row format, as opposed to just the columns that are needed.

However, UDFs are only one modality for exposing this capability and, given the 
Scala registration requirement for the UDFs, not necessarily the best one. If 
we add this capability for UDFs, I would suggest we also enhance the Dataset 
API with column-level {{map}}/{{flatMap}} functionality, e.g.,
{code:scala}
def flatMapColumn[C: Encoder, U: Encoder](colName: String)(func: C => 
TraversableOnce[U]): Dataset[U]
{code}
While multiple columns can be passed in using {{functions.struct(col1, col2, 
...)}} and mapped to {{C}} that is {{TupleN}}, if that costs additional 
processing (internal buffer copying, serialization/deserialization), it would 
be trivial (and transparent to users if we rename {{flatMapColumn}} to 
{{flatMapColumns}} and {{colName}} to {{colName1}} above) to add versions for 2 
and 3 columns, which would cover 99+% of all uses:
{code:scala}
def flatMapColumns[C1, C2, U](colName1: String, colName2: String)
  (func: (C1, C2) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2)], evU: Encoder[U]): Dataset[U]

def flatMapColumns[C1, C2, C3, U](colName1: String, colName2: String, colName3: 
String)
  (func: (C1, C2, C3) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2, C3)], evU: Encoder[U]): Dataset[U]
{code}
[~cloud_fan] There are at least three benefits to adding this capability.
 # It provides a fundamental missing capability to the Dataset API: 
transforming data while knowing only part of the schema.
 # It makes use from Java more convenient, without the need for {{TypeTag}}, 
while making it consistent with {{map}}/{{flatMap}} behavior (via 
{{MapFunction}}/{{F

[jira] [Created] (SPARK-30225) "stream is corrup" exception on reading of shuffle disk-spilled data

2019-12-11 Thread Mala Chikka Kempanna (Jira)
Mala Chikka Kempanna created SPARK-30225:


 Summary: "stream is corrup" exception on reading of shuffle 
disk-spilled data
 Key: SPARK-30225
 URL: https://issues.apache.org/jira/browse/SPARK-30225
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.0
Reporter: Mala Chikka Kempanna


There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
2.4.0, which is introduced by https://issues.apache.org/jira/browse/SPARK-23366

Workaround for this problem is to disable readahead of unsafe spill with 
following.
--conf spark.unsafe.sorter.spill.read.ahead.enabled=false



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown



 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30225) "stream is corrup" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mala Chikka Kempanna updated SPARK-30225:
-
Summary: "stream is corrup" exception on reading disk-spilled data of a 
shuffle operation  (was: "stream is corrup" exception on reading of shuffle 
disk-spilled data)

> "stream is corrup" exception on reading disk-spilled data of a shuffle 
> operation
> 
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/browse/SPARK-30225
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Mala Chikka Kempanna
>Priority: Major
>
> There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
> 2.4.0, which is introduced by 
> https://issues.apache.org/jira/browse/SPARK-23366
> Workaround for this problem is to disable readahead of unsafe spill with 
> following.
> --conf spark.unsafe.sorter.spill.read.ahead.enabled=false
> 
> 19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
> data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
> sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk 
> (1  time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
> spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
> executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
> 6)java.io.IOException: Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
> executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
> INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 
> 01:54:00 INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 
> GB to disk (0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor 
> is trying to kill task 8.1 in stage 0.0 (TID 33), reason: Stage 
> cancelled19/12/10 01:54:30 INFO executor.Executor: Executor killed task 8.1 
> in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:52 INFO 
> executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
> 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30225) "stream is corrupt" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mala Chikka Kempanna updated SPARK-30225:
-
Summary: "stream is corrupt" exception on reading disk-spilled data of a 
shuffle operation  (was: "stream is corrup" exception on reading disk-spilled 
data of a shuffle operation)

> "stream is corrupt" exception on reading disk-spilled data of a shuffle 
> operation
> -
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/browse/SPARK-30225
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Mala Chikka Kempanna
>Priority: Major
>
> There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
> 2.4.0, which is introduced by 
> https://issues.apache.org/jira/browse/SPARK-23366
> Workaround for this problem is to disable readahead of unsafe spill with 
> following.
> --conf spark.unsafe.sorter.spill.read.ahead.enabled=false
> 
> 19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
> data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
> sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk 
> (1  time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
> spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
> executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
> 6)java.io.IOException: Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
> executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
> INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 
> 01:54:00 INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 
> GB to disk (0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor 
> is trying to kill task 8.1 in stage 0.0 (TID 33), reason: Stage 
> cancelled19/12/10 01:54:30 INFO executor.Executor: Executor killed task 8.1 
> in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:52 INFO 
> executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
> 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30225) "Stream is corrupt" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mala Chikka Kempanna updated SPARK-30225:
-
Summary: "Stream is corrupt" exception on reading disk-spilled data of a 
shuffle operation  (was: "stream is corrupt" exception on reading disk-spilled 
data of a shuffle operation)

> "Stream is corrupt" exception on reading disk-spilled data of a shuffle 
> operation
> -
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/browse/SPARK-30225
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Mala Chikka Kempanna
>Priority: Major
>
> There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
> 2.4.0, which is introduced by 
> https://issues.apache.org/jira/browse/SPARK-23366
> Workaround for this problem is to disable readahead of unsafe spill with 
> following.
> --conf spark.unsafe.sorter.spill.read.ahead.enabled=false
> 
> 19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
> data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
> sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk 
> (1  time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
> spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
> executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
> 6)java.io.IOException: Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
> executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
> INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 
> 01:54:00 INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 
> GB to disk (0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor 
> is trying to kill task 8.1 in stage 0.0 (TID 33), reason: Stage 
> cancelled19/12/10 01:54:30 INFO executor.Executor: Executor killed task 8.1 
> in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:52 INFO 
> executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
> 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30225) "Stream is corrupt" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mala Chikka Kempanna updated SPARK-30225:
-
Description: 
There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
2.4.0, which is introduced by https://issues.apache.org/jira/browse/SPARK-23366

Workaround for this problem is to disable readahead of unsafe spill with 
following.
 --conf spark.unsafe.sorter.spill.read.ahead.enabled=false



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown



 

This issue can be reproduced on Spark 2.4.0 by following the steps in this 
comment of Jira SPARK-18105.

https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461

 

 

  was:
There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
2.4.0, which is introduced by https://issues.apache.org/jira/browse/SPARK-23366

Workaround for this problem is to disable readahead of unsafe spill with 
following.
--conf spark.unsafe.sorter.spill.read.ahead.enabled=false



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown



 

 


> "Stream is corrupt" exception on reading disk-spilled data of a shuffle 
> operation
> -
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/browse/SPARK-30225
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Mala Chikka Kempanna
>Priority: Major
>
> There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
> 2.4.0, which is introduced by 
> https://issues.apache.org/jira/browse/SPARK-23366

[jira] [Updated] (SPARK-30225) "Stream is corrupt" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mala Chikka Kempanna updated SPARK-30225:
-
Description: 
There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
2.4.0, which is introduced by https://issues.apache.org/jira/browse/SPARK-23366

 

Workaround for this problem is to disable readahead of unsafe spill with 
following.
 --conf spark.unsafe.sorter.spill.read.ahead.enabled=false

 

This issue can be reproduced on Spark 2.4.0 by following the steps in this 
comment of Jira SPARK-18105.

https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461

 

Exception looks like below: 

 



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown



 

 

 

  was:
There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
2.4.0, which is introduced by https://issues.apache.org/jira/browse/SPARK-23366

Workaround for this problem is to disable readahead of unsafe spill with 
following.
 --conf spark.unsafe.sorter.spill.read.ahead.enabled=false



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown



 

This issue can be reproduced on Spark 2.4.0 by following the steps in this 
comment of Jira SPARK-18105.

https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461

 

 


> "Stream is corrupt" exception on reading disk-spilled data of a shuffle 
> operation
> -
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/browse/SPARK-30225
> Project: Spark
>  Is

[jira] [Updated] (SPARK-30225) "Stream is corrupt" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mala Chikka Kempanna updated SPARK-30225:
-
Description: 
There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
2.4.0, which is introduced by https://issues.apache.org/jira/browse/SPARK-23366

 

Workaround for this problem is to disable readahead of unsafe spill with 
following.
 --conf spark.unsafe.sorter.spill.read.ahead.enabled=false

 

This issue can be reproduced on Spark 2.4.0 by following the steps in this 
comment of Jira SPARK-18105.

https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461

 

Exception looks like below: 

 



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown

 



 

 

 

  was:
There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
2.4.0, which is introduced by https://issues.apache.org/jira/browse/SPARK-23366

 

Workaround for this problem is to disable readahead of unsafe spill with 
following.
 --conf spark.unsafe.sorter.spill.read.ahead.enabled=false

 

This issue can be reproduced on Spark 2.4.0 by following the steps in this 
comment of Jira SPARK-18105.

https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461

 

Exception looks like below: 

 



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown



 

 

 


> "Stream is corrupt" exception on reading disk-spilled data of a shuffle 
> operation
> -
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/browse/SPARK-302

[jira] [Commented] (SPARK-30225) "Stream is corrupt" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993827#comment-16993827
 ] 

Mala Chikka Kempanna commented on SPARK-30225:
--

Exception looks like below: 

 



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown

 



> "Stream is corrupt" exception on reading disk-spilled data of a shuffle 
> operation
> -
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/browse/SPARK-30225
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Mala Chikka Kempanna
>Priority: Major
>
> There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
> 2.4.0, which is introduced by 
> https://issues.apache.org/jira/browse/SPARK-23366
>  
> Workaround for this problem is to disable readahead of unsafe spill with 
> following.
>  --conf spark.unsafe.sorter.spill.read.ahead.enabled=false
>  
> This issue can be reproduced on Spark 2.4.0 by following the steps in this 
> comment of Jira SPARK-18105.
> https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461
>  
> Exception looks like below: 
>  
> 
> 19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
> data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
> sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk 
> (1  time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
> spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
> executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
> 6)java.io.IOException: Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
> executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
> INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 
> 01:54:00 INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 
> GB to disk (0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor 
> is trying to kill task 8.1 in stage 0.0 (TID 33), reason: Stage 
> cancelled19/12/10 01:54:30 INFO executor.Executor: Executor killed task 8.1 
> in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:52 INFO 
> executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
>  
> 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30225) "Stream is corrupt" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993827#comment-16993827
 ] 

Mala Chikka Kempanna edited comment on SPARK-30225 at 12/11/19 7:02 PM:


Exception looks like below: 

 



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown

 


was (Author: mkempanna):
Exception looks like below: 

 



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown

 



> "Stream is corrupt" exception on reading disk-spilled data of a shuffle 
> operation
> -
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/browse/SPARK-30225
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Mala Chikka Kempanna
>Priority: Major
>
> There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
> 2.4.0, which is introduced by 
> https://issues.apache.org/jira/browse/SPARK-23366
>  
> Workaround for this problem is to disable readahead of unsafe spill with 
> following.
>  --conf spark.unsafe.sorter.spill.read.ahead.enabled=false
>  
> This issue can be reproduced on Spark 2.4.0 by following the steps in this 
> comment of Jira SPARK-18105.
> https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461
>  
> Exception looks like below: 
>  
> 
> 19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
> data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
> sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk 
> (1  time 

[jira] [Issue Comment Deleted] (SPARK-30225) "Stream is corrupt" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mala Chikka Kempanna updated SPARK-30225:
-
Comment: was deleted

(was: Exception looks like below: 

 



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown

 )

> "Stream is corrupt" exception on reading disk-spilled data of a shuffle 
> operation
> -
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/browse/SPARK-30225
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Mala Chikka Kempanna
>Priority: Major
>
> There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
> 2.4.0, which is introduced by 
> https://issues.apache.org/jira/browse/SPARK-23366
>  
> Workaround for this problem is to disable readahead of unsafe spill with 
> following.
>  --conf spark.unsafe.sorter.spill.read.ahead.enabled=false
>  
> This issue can be reproduced on Spark 2.4.0 by following the steps in this 
> comment of Jira SPARK-18105.
> https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461
>  
> Exception looks like below: 
> {code:java}
> 19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
> data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
> sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk 
> (1  time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
> spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
> executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
> 6)java.io.IOException: Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
> executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
> INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 
> 01:54:00 INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 
> GB to disk (0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor 
> is trying to kill task 8.1 in stage 0.0 (TID 33), reason: Stage 
> cancelled19/12/10 01:54:30 INFO executor.Executor: Executor killed task 8.1 
> in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:52 INFO 
> executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30225) "Stream is corrupt" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mala Chikka Kempanna updated SPARK-30225:
-
Description: 
There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
2.4.0, which is introduced by https://issues.apache.org/jira/browse/SPARK-23366

 

Workaround for this problem is to disable readahead of unsafe spill with 
following.
 --conf spark.unsafe.sorter.spill.read.ahead.enabled=false

 

This issue can be reproduced on Spark 2.4.0 by following the steps in this 
comment of Jira SPARK-18105.

https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461

 

Exception looks like below: 
{code:java}
19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown{code}
 

 

 

 

 

  was:
There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
2.4.0, which is introduced by https://issues.apache.org/jira/browse/SPARK-23366

 

Workaround for this problem is to disable readahead of unsafe spill with 
following.
 --conf spark.unsafe.sorter.spill.read.ahead.enabled=false

 

This issue can be reproduced on Spark 2.4.0 by following the steps in this 
comment of Jira SPARK-18105.

https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461

 

Exception looks like below: 

 



19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk (1  
time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
6)java.io.IOException: Stream is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 01:54:00 
INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 GB to disk 
(0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor is trying to 
kill task 8.1 in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:30 
INFO executor.Executor: Executor killed task 8.1 in stage 0.0 (TID 33), reason: 
Stage cancelled19/12/10 01:54:52 INFO executor.CoarseGrainedExecutorBackend: 
Driver commanded a shutdown

 



 

 

 


> "Stream is corrupt" exception on reading disk-spilled data of a shuffle 
> operation
> -
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/b

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2019-12-11 Thread Mala Chikka Kempanna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993834#comment-16993834
 ] 

Mala Chikka Kempanna commented on SPARK-18105:
--

If you are facing this in spark 2.4.0 , then as a temporary measure disable 
unsafe spill readaheads with setting
{code:java}
  --conf spark.unsafe.sorter.spill.read.ahead.enabled=false{code}
 

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Major
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30225) "Stream is corrupted at" exception on reading disk-spilled data of a shuffle operation

2019-12-11 Thread Mala Chikka Kempanna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mala Chikka Kempanna updated SPARK-30225:
-
Summary: "Stream is corrupted at" exception on reading disk-spilled data of 
a shuffle operation  (was: "Stream is corrupt" exception on reading 
disk-spilled data of a shuffle operation)

> "Stream is corrupted at" exception on reading disk-spilled data of a shuffle 
> operation
> --
>
> Key: SPARK-30225
> URL: https://issues.apache.org/jira/browse/SPARK-30225
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Mala Chikka Kempanna
>Priority: Major
>
> There is issues with spark.unsafe.sorter.spill.read.ahead.enabled in spark 
> 2.4.0, which is introduced by 
> https://issues.apache.org/jira/browse/SPARK-23366
>  
> Workaround for this problem is to disable readahead of unsafe spill with 
> following.
>  --conf spark.unsafe.sorter.spill.read.ahead.enabled=false
>  
> This issue can be reproduced on Spark 2.4.0 by following the steps in this 
> comment of Jira SPARK-18105.
> https://issues.apache.org/jira/browse/SPARK-18105?focusedCommentId=16981461&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16981461
>  
> Exception looks like below: 
> {code:java}
> 19/12/10 01:51:31 INFO sort.ShuffleExternalSorter: Thread 142 spilling sort 
> data of 5.1 GB to disk (1  time so far)19/12/10 01:51:31 INFO 
> sort.ShuffleExternalSorter: Thread 142 spilling sort data of 5.1 GB to disk 
> (1  time so far)19/12/10 01:52:48 INFO sort.ShuffleExternalSorter: Thread 142 
> spilling sort data of 5.1 GB to disk (2  times so far)19/12/10 01:53:53 ERROR 
> executor.Executor: Exception in task 6.0 in stage 0.0 (TID 
> 6)java.io.IOException: Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)19/12/10 01:53:53 INFO 
> executor.CoarseGrainedExecutorBackend: Got assigned task 3319/12/10 01:53:53 
> INFO executor.Executor: Running task 8.1 in stage 0.0 (TID 33)19/12/10 
> 01:54:00 INFO sort.UnsafeExternalSorter: Thread 142 spilling sort data of 3.3 
> GB to disk (0  time so far)19/12/10 01:54:30 INFO executor.Executor: Executor 
> is trying to kill task 8.1 in stage 0.0 (TID 33), reason: Stage 
> cancelled19/12/10 01:54:30 INFO executor.Executor: Executor killed task 8.1 
> in stage 0.0 (TID 33), reason: Stage cancelled19/12/10 01:54:52 INFO 
> executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30195) Some imports, function need more explicit resolution in 2.13

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30195.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26826
[https://github.com/apache/spark/pull/26826]

> Some imports, function need more explicit resolution in 2.13
> 
>
> Key: SPARK-30195
> URL: https://issues.apache.org/jira/browse/SPARK-30195
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> This is a grouping of related but not identical issues in the 2.13 migration, 
> where the compiler is more picky about explicit types and imports. I'm 
> grouping them as they seem moderately related.
> Some are fairly self-evident like wanting an explicit generic type. In a few 
> cases it looks like import resolution rules tightened up a bit and have to be 
> explicit.
> A few more cause problems like:
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala:220:
>  missing parameter type for expanded function
> The argument types of an anonymous function must be fully known. (SLS 8.5)
> Expected type was: ?
> {code}
> In some cases it's just a matter of adding an explicit type, like {{.map { m: 
> Matrix =>}}.
> Many seem to concern functions of tuples, or tuples of tuples.
> {{.mapGroups { case (g, iter) =>}} needs to be simply {{.mapGroups { (g, 
> iter) =>}}
> Or more annoyingly:
> {code}
> }.reduceByKey { case ((wc1, df1), (wc2, df2)) =>
>   (wc1 + wc2, df1 + df2)
> }
> {code}
> Apparently can only be fully known without nesting tuples. This _won't_ work:
> {code}
> }.reduceByKey { case ((wc1: Long, df1: Int), (wc2: Long, df2: Int)) =>
>   (wc1 + wc2, df1 + df2)
> }
> {code}
> This does:
> {code}
> }.reduceByKey { (wcdf1, wcdf2) =>
>   (wcdf1._1 + wcdf2._1, wcdf1._2 + wcdf2._2)
> }
> {code}
> I'm not super clear why most of the problems seem to affect reduceByKey.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30226) Remove withXXX functions in WriteBuilder

2019-12-11 Thread Ximo Guanter (Jira)
Ximo Guanter created SPARK-30226:


 Summary: Remove withXXX functions in WriteBuilder
 Key: SPARK-30226
 URL: https://issues.apache.org/jira/browse/SPARK-30226
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ximo Guanter


We can add a LogicalWrite interface that provides compile-time guarantees that 
the Spark code is correct, vs the current withXXX functions in WriteBuilder 
which might not be called correctly by other Spark code (e.g. they might not be 
called, or might be called multiple times).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30038) DESCRIBE FUNCTION should look up catalog/table like v2 commands

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30038:
-

Assignee: Pablo Langa Blanco

> DESCRIBE FUNCTION should look up catalog/table like v2 commands
> ---
>
> Key: SPARK-30038
> URL: https://issues.apache.org/jira/browse/SPARK-30038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Assignee: Pablo Langa Blanco
>Priority: Major
>
> DESCRIBE FUNCTION should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30038) DESCRIBE FUNCTION should look up catalog/table like v2 commands

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30038.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26840
[https://github.com/apache/spark/pull/26840]

> DESCRIBE FUNCTION should look up catalog/table like v2 commands
> ---
>
> Key: SPARK-30038
> URL: https://issues.apache.org/jira/browse/SPARK-30038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Assignee: Pablo Langa Blanco
>Priority: Major
> Fix For: 3.0.0
>
>
> DESCRIBE FUNCTION should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30198) BytesToBytesMap does not grow internal long array as expected

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30198.
---
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 26828
[https://github.com/apache/spark/pull/26828]

> BytesToBytesMap does not grow internal long array as expected
> -
>
> Key: SPARK-30198
> URL: https://issues.apache.org/jira/browse/SPARK-30198
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> One Spark job on our cluster hangs forever at BytesToBytesMap.safeLookup. 
> After inspecting, the long array size is 536870912.
> Currently in BytesToBytesMap.append, we only grow the internal array if the 
> size of the array is less than its MAX_CAPACITY that is 536870912. So in 
> above case, the array can not be grown up, and safeLookup can not find an 
> empty slot forever.
> But it is wrong because we use two array entries per key, so the array size 
> is twice the capacity. We should compare the current capacity of the array, 
> instead of its size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30198) BytesToBytesMap does not grow internal long array as expected

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30198:
--
Affects Version/s: 2.3.4
   2.4.4

> BytesToBytesMap does not grow internal long array as expected
> -
>
> Key: SPARK-30198
> URL: https://issues.apache.org/jira/browse/SPARK-30198
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> One Spark job on our cluster hangs forever at BytesToBytesMap.safeLookup. 
> After inspecting, the long array size is 536870912.
> Currently in BytesToBytesMap.append, we only grow the internal array if the 
> size of the array is less than its MAX_CAPACITY that is 536870912. So in 
> above case, the array can not be grown up, and safeLookup can not find an 
> empty slot forever.
> But it is wrong because we use two array entries per key, so the array size 
> is twice the capacity. We should compare the current capacity of the array, 
> instead of its size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30198) BytesToBytesMap does not grow internal long array as expected

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30198:
--
Affects Version/s: 2.2.3

> BytesToBytesMap does not grow internal long array as expected
> -
>
> Key: SPARK-30198
> URL: https://issues.apache.org/jira/browse/SPARK-30198
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> One Spark job on our cluster hangs forever at BytesToBytesMap.safeLookup. 
> After inspecting, the long array size is 536870912.
> Currently in BytesToBytesMap.append, we only grow the internal array if the 
> size of the array is less than its MAX_CAPACITY that is 536870912. So in 
> above case, the array can not be grown up, and safeLookup can not find an 
> empty slot forever.
> But it is wrong because we use two array entries per key, so the array size 
> is twice the capacity. We should compare the current capacity of the array, 
> instead of its size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30198) BytesToBytesMap does not grow internal long array as expected

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30198:
--
Affects Version/s: 2.0.2
   2.1.3

> BytesToBytesMap does not grow internal long array as expected
> -
>
> Key: SPARK-30198
> URL: https://issues.apache.org/jira/browse/SPARK-30198
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> One Spark job on our cluster hangs forever at BytesToBytesMap.safeLookup. 
> After inspecting, the long array size is 536870912.
> Currently in BytesToBytesMap.append, we only grow the internal array if the 
> size of the array is less than its MAX_CAPACITY that is 536870912. So in 
> above case, the array can not be grown up, and safeLookup can not find an 
> empty slot forever.
> But it is wrong because we use two array entries per key, so the array size 
> is twice the capacity. We should compare the current capacity of the array, 
> instead of its size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30198) BytesToBytesMap does not grow internal long array as expected

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30198:
--
Affects Version/s: 1.6.3

> BytesToBytesMap does not grow internal long array as expected
> -
>
> Key: SPARK-30198
> URL: https://issues.apache.org/jira/browse/SPARK-30198
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> One Spark job on our cluster hangs forever at BytesToBytesMap.safeLookup. 
> After inspecting, the long array size is 536870912.
> Currently in BytesToBytesMap.append, we only grow the internal array if the 
> size of the array is less than its MAX_CAPACITY that is 536870912. So in 
> above case, the array can not be grown up, and safeLookup can not find an 
> empty slot forever.
> But it is wrong because we use two array entries per key, so the array size 
> is twice the capacity. We should compare the current capacity of the array, 
> instead of its size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2019-12-11 Thread Nasir Ali (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993981#comment-16993981
 ] 

Nasir Ali commented on SPARK-22947:
---

[~icexelloss] Is there any update on this issue? Or is there any alternate in 
Spark to perform such task now?

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, price
> 20151231, 1, 100.0
> 20150102, 1, 105.0
> 20150102, 2, 195.0
> Output:
> time, id, quantity, price
> 20160101, 1, 100, 100.0
> 20160101, 2, 50, null
> 20160102, 1, -50, 105.0
> 20160102, 2, 50, 195.0
> {code}
> h2. Optional Design Sketch
> h3.

[jira] [Created] (SPARK-30227) Add close() on DataWriter interface

2019-12-11 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-30227:


 Summary: Add close() on DataWriter interface
 Key: SPARK-30227
 URL: https://issues.apache.org/jira/browse/SPARK-30227
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


If the scaladoc of DataWriter is correct, the lifecycle of DataWriter instance 
ends at either commit() or abort(). That makes datasource implementors to feel 
they can place resource cleanup in both sides, but abort() can be called when 
commit() fails; so they have to ensure they don't do double-cleanup if cleanup 
is not idempotent.

So I'm proposing to add close() on DataWriter explicitly, which is "the place" 
for resource cleanup. The lifecycle of DataWriter instance will (and should) 
end at close().

I've checked some callers to see whether they can apply "try-catch-finally" to 
ensure close() is called at the end of lifecycle for DataWriter, and they look 
like so.

The change would bring backward incompatible change, but given the interface is 
marked as Evolving and we're making backward incompatible changes in Spark 3.0, 
so I feel it may not matter.

I've raised the discussion around this issue and the feedbacks are positive: 
https://lists.apache.org/thread.html/bfdb989fa83bc4d774804473610bd0cfcaa1dd5a020ca9a522f3510c%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30227) Add close() on DataWriter interface

2019-12-11 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994022#comment-16994022
 ] 

Jungtaek Lim commented on SPARK-30227:
--

Working on it. I'll raise a PR soon.

> Add close() on DataWriter interface
> ---
>
> Key: SPARK-30227
> URL: https://issues.apache.org/jira/browse/SPARK-30227
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> If the scaladoc of DataWriter is correct, the lifecycle of DataWriter 
> instance ends at either commit() or abort(). That makes datasource 
> implementors to feel they can place resource cleanup in both sides, but 
> abort() can be called when commit() fails; so they have to ensure they don't 
> do double-cleanup if cleanup is not idempotent.
> So I'm proposing to add close() on DataWriter explicitly, which is "the 
> place" for resource cleanup. The lifecycle of DataWriter instance will (and 
> should) end at close().
> I've checked some callers to see whether they can apply "try-catch-finally" 
> to ensure close() is called at the end of lifecycle for DataWriter, and they 
> look like so.
> The change would bring backward incompatible change, but given the interface 
> is marked as Evolving and we're making backward incompatible changes in Spark 
> 3.0, so I feel it may not matter.
> I've raised the discussion around this issue and the feedbacks are positive: 
> https://lists.apache.org/thread.html/bfdb989fa83bc4d774804473610bd0cfcaa1dd5a020ca9a522f3510c%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30223) queries in thrift server may read wrong SQL configs

2019-12-11 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994061#comment-16994061
 ] 

angerszhu commented on SPARK-30223:
---

[~cloud_fan]
Add 
```
SparkSession.setActiveSession(sqlContext.sparkSession) 
```
in each Metadata Operation?

> queries in thrift server may read wrong SQL configs
> ---
>
> Key: SPARK-30223
> URL: https://issues.apache.org/jira/browse/SPARK-30223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The Spark thrift server creates many SparkSessions to serve requests, and the 
> thrift server serves requests using a single thread. One thread can only have 
> one active SparkSession, so SQLCong.get can't get the proper conf from the 
> session that runs the query.
> Whenever we issue an action on a SparkSession, we should set this session as 
> active session, e.g. `SparkSession.sql`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30223) queries in thrift server may read wrong SQL configs

2019-12-11 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994061#comment-16994061
 ] 

angerszhu edited comment on SPARK-30223 at 12/12/19 1:49 AM:
-

[~cloud_fan]
Add 
```
SparkSession.setActiveSession(sqlContext.sparkSession) 
```
in each Metadata Operation?

Maybe we should sort out the use of `SQLConf.get()` like 
https://github.com/apache/spark/pull/26187


was (Author: angerszhuuu):
[~cloud_fan]
Add 
```
SparkSession.setActiveSession(sqlContext.sparkSession) 
```
in each Metadata Operation?

> queries in thrift server may read wrong SQL configs
> ---
>
> Key: SPARK-30223
> URL: https://issues.apache.org/jira/browse/SPARK-30223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The Spark thrift server creates many SparkSessions to serve requests, and the 
> thrift server serves requests using a single thread. One thread can only have 
> one active SparkSession, so SQLCong.get can't get the proper conf from the 
> session that runs the query.
> Whenever we issue an action on a SparkSession, we should set this session as 
> active session, e.g. `SparkSession.sql`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30223) queries in thrift server may read wrong SQL configs

2019-12-11 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994061#comment-16994061
 ] 

angerszhu edited comment on SPARK-30223 at 12/12/19 1:49 AM:
-

[~cloud_fan] [~yumwang]
Add 
```
SparkSession.setActiveSession(sqlContext.sparkSession) 
```
in each Metadata Operation?

Maybe we should sort out the use of `SQLConf.get()` like 
https://github.com/apache/spark/pull/26187


was (Author: angerszhuuu):
[~cloud_fan]
Add 
```
SparkSession.setActiveSession(sqlContext.sparkSession) 
```
in each Metadata Operation?

Maybe we should sort out the use of `SQLConf.get()` like 
https://github.com/apache/spark/pull/26187

> queries in thrift server may read wrong SQL configs
> ---
>
> Key: SPARK-30223
> URL: https://issues.apache.org/jira/browse/SPARK-30223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The Spark thrift server creates many SparkSessions to serve requests, and the 
> thrift server serves requests using a single thread. One thread can only have 
> one active SparkSession, so SQLCong.get can't get the proper conf from the 
> session that runs the query.
> Whenever we issue an action on a SparkSession, we should set this session as 
> active session, e.g. `SparkSession.sql`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30228) Update zstd-jni to 1.4.4-3

2019-12-11 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30228:
-

 Summary: Update zstd-jni to 1.4.4-3
 Key: SPARK-30228
 URL: https://issues.apache.org/jira/browse/SPARK-30228
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30222) Still getting KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKe

2019-12-11 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-30222.
--
   Fix Version/s: (was: 2.4.1)
Target Version/s:   (was: 2.4.1)
  Resolution: Invalid

> Still getting KafkaConsumer cache hitting max capacity of 64, removing 
> consumer for CacheKe
> ---
>
> Key: SPARK-30222
> URL: https://issues.apache.org/jira/browse/SPARK-30222
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
> Environment: {{Below are the logs.}}
> 2019-12-11 08:33:31,504 [Executor task launch worker for task 1050] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-21)
> 2019-12-11 08:33:32,493 [Executor task launch worker for task 1051] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-5)
> 2019-12-11 08:33:32,570 [Executor task launch worker for task 1052] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-9)
> 2019-12-11 08:33:33,441 [Executor task launch worker for task 1053] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-17)
> 2019-12-11 08:33:33,619 [Executor task launch worker for task 1054] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-2)
> 2019-12-11 08:33:34,474 [Executor task launch worker for task 1055] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-10)
> 2019-12-11 08:33:35,006 [Executor task launch worker for task 1056] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-6)
> 2019-12-11 08:33:36,326 [Executor task launch worker for task 1057] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-14)
> 2019-12-11 08:33:36,634 [Executor task launch worker for task 1058] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-0)
> 2019-12-11 08:33:37,496 [Executor task launch worker for task 1059] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-19)
> 2019-12-11 08:33:39,183 [stream execution thread for [id = 
> b3aec196-e4f2-4ef9-973b-d5685eba917e, runId = 
> 5c35a63a-16ad-4899-b732-1019397770bd]] WARN 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor - Current 
> batch is falling behind. The trigger interval is 15000 milliseconds, but 
> spent 63438 milliseconds
>Reporter: Shyam
>Priority: Major
>
> Me using spark-sql-2.4.1 version with Kafka 0.10 v.
> While I try to consume data by consumer. it gives error below even after 
> setting 
> .option("spark.sql.kafkaConsumerCache.capacity",128)
>  
> {{Dataset df = sparkSession}}
> {{       .readStream()}}
> {{       .format("kafka")}}
> {{       .option("kafka.bootstrap.servers", SERVERS)}}
> {{       .option("subscribe", TOPIC) }}{{}}
> {{       .option("spark.sql.kafkaConsumerCache.capacity",128)   }}
> {{ }}
> {{       .load();}}
> {{}}
> {{}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---

[jira] [Commented] (SPARK-30222) Still getting KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKe

2019-12-11 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994085#comment-16994085
 ] 

Jungtaek Lim commented on SPARK-30222:
--

That config is renamed from "spark.kafka.consumer.cache.capacity" to 
"spark.sql.kafkaConsumerCache.capacity" in Spark 3.0. Please use 
"spark.kafka.consumer.cache.capacity" instead.

I'm closing this. Please reopen this if the problem persists after changing 
your configuration key.

> Still getting KafkaConsumer cache hitting max capacity of 64, removing 
> consumer for CacheKe
> ---
>
> Key: SPARK-30222
> URL: https://issues.apache.org/jira/browse/SPARK-30222
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
> Environment: {{Below are the logs.}}
> 2019-12-11 08:33:31,504 [Executor task launch worker for task 1050] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-21)
> 2019-12-11 08:33:32,493 [Executor task launch worker for task 1051] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-5)
> 2019-12-11 08:33:32,570 [Executor task launch worker for task 1052] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-9)
> 2019-12-11 08:33:33,441 [Executor task launch worker for task 1053] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-17)
> 2019-12-11 08:33:33,619 [Executor task launch worker for task 1054] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-2)
> 2019-12-11 08:33:34,474 [Executor task launch worker for task 1055] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-10)
> 2019-12-11 08:33:35,006 [Executor task launch worker for task 1056] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-6)
> 2019-12-11 08:33:36,326 [Executor task launch worker for task 1057] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-14)
> 2019-12-11 08:33:36,634 [Executor task launch worker for task 1058] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-0)
> 2019-12-11 08:33:37,496 [Executor task launch worker for task 1059] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-19)
> 2019-12-11 08:33:39,183 [stream execution thread for [id = 
> b3aec196-e4f2-4ef9-973b-d5685eba917e, runId = 
> 5c35a63a-16ad-4899-b732-1019397770bd]] WARN 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor - Current 
> batch is falling behind. The trigger interval is 15000 milliseconds, but 
> spent 63438 milliseconds
>Reporter: Shyam
>Priority: Major
> Fix For: 2.4.1
>
>
> Me using spark-sql-2.4.1 version with Kafka 0.10 v.
> While I try to consume data by consumer. it gives error below even after 
> setting 
> .option("spark.sql.kafkaConsumerCache.capacity",128)
>  
> {{Dataset df = sparkSession}}
> {{       .readStream()}}
> {{       .format("kafka")}}
> {{       .option("kafka.bootstrap.servers", S

[jira] [Resolved] (SPARK-30199) Recover spark.ui.port and spark.blockManager.port from checkpoint

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30199.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26827
[https://github.com/apache/spark/pull/26827]

> Recover spark.ui.port and spark.blockManager.port from checkpoint
> -
>
> Key: SPARK-30199
> URL: https://issues.apache.org/jira/browse/SPARK-30199
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30222) Still getting KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKe

2019-12-11 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994093#comment-16994093
 ] 

Jungtaek Lim commented on SPARK-30222:
--

Sorry I confused two configurations. "spark.sql.kafkaConsumerCache.capacity" is 
for Spark less than 3.0, but you have to set it up in Spark configuration 
instead of source option. That should be the reason.

> Still getting KafkaConsumer cache hitting max capacity of 64, removing 
> consumer for CacheKe
> ---
>
> Key: SPARK-30222
> URL: https://issues.apache.org/jira/browse/SPARK-30222
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
> Environment: {{Below are the logs.}}
> 2019-12-11 08:33:31,504 [Executor task launch worker for task 1050] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-21)
> 2019-12-11 08:33:32,493 [Executor task launch worker for task 1051] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-5)
> 2019-12-11 08:33:32,570 [Executor task launch worker for task 1052] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-9)
> 2019-12-11 08:33:33,441 [Executor task launch worker for task 1053] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-17)
> 2019-12-11 08:33:33,619 [Executor task launch worker for task 1054] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-2)
> 2019-12-11 08:33:34,474 [Executor task launch worker for task 1055] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-10)
> 2019-12-11 08:33:35,006 [Executor task launch worker for task 1056] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-6)
> 2019-12-11 08:33:36,326 [Executor task launch worker for task 1057] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-14)
> 2019-12-11 08:33:36,634 [Executor task launch worker for task 1058] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-0)
> 2019-12-11 08:33:37,496 [Executor task launch worker for task 1059] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-19)
> 2019-12-11 08:33:39,183 [stream execution thread for [id = 
> b3aec196-e4f2-4ef9-973b-d5685eba917e, runId = 
> 5c35a63a-16ad-4899-b732-1019397770bd]] WARN 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor - Current 
> batch is falling behind. The trigger interval is 15000 milliseconds, but 
> spent 63438 milliseconds
>Reporter: Shyam
>Priority: Major
>
> Me using spark-sql-2.4.1 version with Kafka 0.10 v.
> While I try to consume data by consumer. it gives error below even after 
> setting 
> .option("spark.sql.kafkaConsumerCache.capacity",128)
>  
> {{Dataset df = sparkSession}}
> {{       .readStream()}}
> {{       .format("kafka")}}
> {{       .option("kafka.bootstrap.servers", SERVERS)}}
> {{       .option("subscribe", TOPIC) }}{{}}
> {{       .option("spark.sql.kafkaConsumerCache.ca

[jira] [Comment Edited] (SPARK-30222) Still getting KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKe

2019-12-11 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994085#comment-16994085
 ] 

Jungtaek Lim edited comment on SPARK-30222 at 12/12/19 2:28 AM:


-That config is renamed from "spark.kafka.consumer.cache.capacity" to 
"spark.sql.kafkaConsumerCache.capacity" in Spark 3.0. Please use 
"spark.kafka.consumer.cache.capacity" instead.-

-I'm closing this. Please reopen this if the problem persists after changing 
your configuration key.-

EDIT: "spark.sql.kafkaConsumerCache.capacity" is for Spark less than 3.0 so the 
configuration name is correct. But it should be specified in Spark config.


was (Author: kabhwan):
-That config is renamed from "spark.kafka.consumer.cache.capacity" to 
"spark.sql.kafkaConsumerCache.capacity" in Spark 3.0. Please use 
"spark.kafka.consumer.cache.capacity" instead.-

I'm closing this. Please reopen this if the problem persists after changing 
your configuration key.

EDIT: "spark.sql.kafkaConsumerCache.capacity" is for Spark less than 3.0 so the 
configuration name is correct.

> Still getting KafkaConsumer cache hitting max capacity of 64, removing 
> consumer for CacheKe
> ---
>
> Key: SPARK-30222
> URL: https://issues.apache.org/jira/browse/SPARK-30222
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
> Environment: {{Below are the logs.}}
> 2019-12-11 08:33:31,504 [Executor task launch worker for task 1050] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-21)
> 2019-12-11 08:33:32,493 [Executor task launch worker for task 1051] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-5)
> 2019-12-11 08:33:32,570 [Executor task launch worker for task 1052] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-9)
> 2019-12-11 08:33:33,441 [Executor task launch worker for task 1053] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-17)
> 2019-12-11 08:33:33,619 [Executor task launch worker for task 1054] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-2)
> 2019-12-11 08:33:34,474 [Executor task launch worker for task 1055] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-10)
> 2019-12-11 08:33:35,006 [Executor task launch worker for task 1056] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-6)
> 2019-12-11 08:33:36,326 [Executor task launch worker for task 1057] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-14)
> 2019-12-11 08:33:36,634 [Executor task launch worker for task 1058] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-0)
> 2019-12-11 08:33:37,496 [Executor task launch worker for task 1059] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-19)
> 2019-12-11 08:33:39,183 [stream execution thread for [id = 
> b3aec196-e4f2-4ef9-973b-d5685eba917e, runId = 
> 5c35a63a-16ad-4

[jira] [Comment Edited] (SPARK-30222) Still getting KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKe

2019-12-11 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994085#comment-16994085
 ] 

Jungtaek Lim edited comment on SPARK-30222 at 12/12/19 2:28 AM:


-That config is renamed from "spark.kafka.consumer.cache.capacity" to 
"spark.sql.kafkaConsumerCache.capacity" in Spark 3.0. Please use 
"spark.kafka.consumer.cache.capacity" instead.-

I'm closing this. Please reopen this if the problem persists after changing 
your configuration key.

EDIT: "spark.sql.kafkaConsumerCache.capacity" is for Spark less than 3.0 so the 
configuration name is correct.


was (Author: kabhwan):
That config is renamed from "spark.kafka.consumer.cache.capacity" to 
"spark.sql.kafkaConsumerCache.capacity" in Spark 3.0. Please use 
"spark.kafka.consumer.cache.capacity" instead.

I'm closing this. Please reopen this if the problem persists after changing 
your configuration key.

> Still getting KafkaConsumer cache hitting max capacity of 64, removing 
> consumer for CacheKe
> ---
>
> Key: SPARK-30222
> URL: https://issues.apache.org/jira/browse/SPARK-30222
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
> Environment: {{Below are the logs.}}
> 2019-12-11 08:33:31,504 [Executor task launch worker for task 1050] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-21)
> 2019-12-11 08:33:32,493 [Executor task launch worker for task 1051] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-5)
> 2019-12-11 08:33:32,570 [Executor task launch worker for task 1052] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-9)
> 2019-12-11 08:33:33,441 [Executor task launch worker for task 1053] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-17)
> 2019-12-11 08:33:33,619 [Executor task launch worker for task 1054] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-2)
> 2019-12-11 08:33:34,474 [Executor task launch worker for task 1055] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-10)
> 2019-12-11 08:33:35,006 [Executor task launch worker for task 1056] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-6)
> 2019-12-11 08:33:36,326 [Executor task launch worker for task 1057] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-14)
> 2019-12-11 08:33:36,634 [Executor task launch worker for task 1058] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-0)
> 2019-12-11 08:33:37,496 [Executor task launch worker for task 1059] WARN 
> org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting 
> max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,COMPANY_TRANSACTIONS_INBOUND-19)
> 2019-12-11 08:33:39,183 [stream execution thread for [id = 
> b3aec196-e4f2-4ef9-973b-d5685eba917e, runId = 
> 5c35a63a-16ad-4899-b732-1019397770bd]] WARN 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor - Current 
> batch is falling behind. The trigger interval is 150

[jira] [Commented] (SPARK-30224) Configurable field and record separators for filters

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994100#comment-16994100
 ] 

Hyukjin Kwon commented on SPARK-30224:
--

Can we make it support quote and/or escaping?

> Configurable field and record separators for filters
> 
>
> Key: SPARK-30224
> URL: https://issues.apache.org/jira/browse/SPARK-30224
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: thierry accart
>Priority: Major
>
> The method addfilter of JettyUtils is parsing sparkConf using strict field 
> and record separator: at this time, it's not possible to have = or , signs in 
> a parameter value. This brings impossibility to configure for exampe Hadoop's 
> LDAP authenticator filter, as some of its parameters contain comma and equal 
> signs.
> To fix this problem, there are two solutions:
> either we allow spark..params.= in sparkConf
> or 
> we allow spark..params.rs= and 
> spark..params.fs= 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30220) Support Filter expression uses IN/EXISTS predicate sub-queries

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994101#comment-16994101
 ] 

Hyukjin Kwon commented on SPARK-30220:
--

[~beliefer], check you check the comments of its parent JIRA? Should better 
check other DBMSes too.

> Support Filter expression uses IN/EXISTS predicate sub-queries
> --
>
> Key: SPARK-30220
> URL: https://issues.apache.org/jira/browse/SPARK-30220
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL cannot supports a SQL with nested aggregate as below:
>  
> {code:java}
> select sum(unique1) FILTER (WHERE
>  unique1 IN (SELECT unique1 FROM onek where unique1 < 100)) FROM tenk1;{code}
>  
> And Spark will throw exception as follows:
>  
> {code:java}
> org.apache.spark.sql.AnalysisException
> IN/EXISTS predicate sub-queries can only be used in Filter/Join and a few 
> commands: Aggregate [sum(cast(unique1#x as bigint)) AS sum(unique1)#xL]
> : +- Project [unique1#x]
> : +- Filter (unique1#x < 100)
> : +- SubqueryAlias `onek`
> : +- RelationV2[unique1#x, unique2#x, two#x, four#x, ten#x, twenty#x, 
> hundred#x, thousand#x, twothousand#x, fivethous#x, tenthous#x, odd#x, even#x, 
> stringu1#x, stringu2#x, string4#x] csv 
> file:/home/xitong/code/gengjiaan/spark/sql/core/target/scala-2.12/test-classes/test-data/postgresql/onek.data
> +- SubqueryAlias `tenk1`
>  +- RelationV2[unique1#x, unique2#x, two#x, four#x, ten#x, twenty#x, 
> hundred#x, thousand#x, twothousand#x, fivethous#x, tenthous#x, odd#x, even#x, 
> stringu1#x, stringu2#x, string4#x] csv 
> file:/home/xitong/code/gengjiaan/spark/sql/core/target/scala-2.12/test-classes/test-data/postgresql/tenk.data{code}
>  
> But PostgreSQL supports this syntax.
> {code:java}
> select sum(unique1) FILTER (WHERE
>  unique1 IN (SELECT unique1 FROM onek where unique1 < 100)) FROM tenk1;
>  sum 
> --
>  4950
> (1 row){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30219) Support Filter expression reference the outer query

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994102#comment-16994102
 ] 

Hyukjin Kwon commented on SPARK-30219:
--

[~beliefer] check the comments in its parent JIRA.

> Support Filter expression reference the outer query
> ---
>
> Key: SPARK-30219
> URL: https://issues.apache.org/jira/browse/SPARK-30219
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL cannot supports a SQL with nested aggregate as below:
>  
> {code:java}
> select (select count(*) filter (where outer_c <> 0)
>   from (values (1)) t0(inner_c))
> from (values (2),(3)) t1(outer_c);{code}
>  
> And Spark will throw exception as follows:
> {code:java}
> org.apache.spark.sql.AnalysisException
> Expressions referencing the outer query are not supported outside of 
> WHERE/HAVING clauses:
> Aggregate [count(1) AS count(1)#xL]
> +- Project [col1#x AS inner_c#x]
>  +- SubqueryAlias `t0`
>  +- LocalRelation [col1#x]{code}
> But PostgreSQL supports this syntax.
>  
> {code:java}
> select (select count(*) filter (where outer_c <> 0)
>   from (values (1)) t0(inner_c))
> from (values (2),(3)) t1(outer_c); -- outer query is aggregation query
>  count 
> ---
>  2
> (1 row){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30208) A race condition when reading from Kafka in PySpark

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994105#comment-16994105
 ] 

Hyukjin Kwon commented on SPARK-30208:
--

Possibly it's a duplicate of SPARK-22340

> A race condition when reading from Kafka in PySpark
> ---
>
> Key: SPARK-30208
> URL: https://issues.apache.org/jira/browse/SPARK-30208
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Jiawen Zhu
>Priority: Major
>
> When using PySpark to read from Kafka, there is a race condition that Spark 
> may use KafkaConsumer in multiple threads at the same time and throw the 
> following error:
> {code}
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
> at 
> kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:2215)
> at 
> kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2104)
> at 
> kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2059)
> at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.close(KafkaDataConsumer.scala:451)
> at 
> org.apache.spark.sql.kafka010.KafkaDataConsumer$NonCachedKafkaDataConsumer.release(KafkaDataConsumer.scala:508)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.close(KafkaSourceRDD.scala:126)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:66)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anonfun$compute$3.apply(KafkaSourceRDD.scala:131)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anonfun$compute$3.apply(KafkaSourceRDD.scala:130)
> at 
> org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:162)
> at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:131)
> at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:131)
> at 
> org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:144)
> at 
> org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:142)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:142)
> at 
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:130)
> at org.apache.spark.scheduler.Task.doRunTask(Task.scala:155)
> at org.apache.spark.scheduler.Task.run(Task.scala:112)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> When using PySpark, reading from Kafka is actually happening in a separate 
> writer thread rather that the task thread.  When a task is early terminated 
> (e.g., there is a limit operator), the task thread may stop the KafkaConsumer 
> when the writer thread is using it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30182) Support nested aggregates

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994106#comment-16994106
 ] 

Hyukjin Kwon commented on SPARK-30182:
--

[~beliefer] please check the comments in its parent JIRA.

> Support nested aggregates
> -
>
> Key: SPARK-30182
> URL: https://issues.apache.org/jira/browse/SPARK-30182
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL cannot supports a SQL with nested aggregate as below:
> {code:java}
> SELECT sum(salary), row_number() OVER (ORDER BY depname), sum(
>  sum(salary) FILTER (WHERE enroll_date > '2007-01-01')
> ) FILTER (WHERE depname <> 'sales') OVER (ORDER BY depname DESC) AS 
> "filtered_sum",
>  depname
> FROM empsalary GROUP BY depname;{code}
> And Spark will throw exception as follows:
> {code:java}
> org.apache.spark.sql.AnalysisException
> It is not allowed to use an aggregate function in the argument of another 
> aggregate function. Please use the inner aggregate function in a 
> sub-query.{code}
> But PostgreSQL supports this syntax.
> {code:java}
> SELECT sum(salary), row_number() OVER (ORDER BY depname), sum(
>  sum(salary) FILTER (WHERE enroll_date > '2007-01-01')
> ) FILTER (WHERE depname <> 'sales') OVER (ORDER BY depname DESC) AS 
> "filtered_sum",
>  depname
> FROM empsalary GROUP BY depname;
>  sum | row_number | filtered_sum | depname 
> ---++--+---
>  25100 | 1 | 22600 | develop
>  7400 | 2 | 3500 | personnel
>  14600 | 3 | | sales
> (3 rows){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30173) Automatically close stale PRs

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994108#comment-16994108
 ] 

Hyukjin Kwon commented on SPARK-30173:
--

[~nchammas] please go ahead!

> Automatically close stale PRs
> -
>
> Key: SPARK-30173
> URL: https://issues.apache.org/jira/browse/SPARK-30173
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> To manage the number of open PRs we have at any one time, we should 
> automatically close stale PRs with a friendly message.
> Background discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Closing-stale-PRs-with-a-GitHub-Action-td28477.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30199) Recover spark.ui.port and spark.blockManager.port from checkpoint

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30199:
-

Assignee: Aaruna Godthi

> Recover spark.ui.port and spark.blockManager.port from checkpoint
> -
>
> Key: SPARK-30199
> URL: https://issues.apache.org/jira/browse/SPARK-30199
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Aaruna Godthi
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30219) Support Filter expression reference the outer query

2019-12-11 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994124#comment-16994124
 ] 

jiaan.geng commented on SPARK-30219:


[~hyukjin.kwon] Thank! I see.

> Support Filter expression reference the outer query
> ---
>
> Key: SPARK-30219
> URL: https://issues.apache.org/jira/browse/SPARK-30219
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL cannot supports a SQL with nested aggregate as below:
>  
> {code:java}
> select (select count(*) filter (where outer_c <> 0)
>   from (values (1)) t0(inner_c))
> from (values (2),(3)) t1(outer_c);{code}
>  
> And Spark will throw exception as follows:
> {code:java}
> org.apache.spark.sql.AnalysisException
> Expressions referencing the outer query are not supported outside of 
> WHERE/HAVING clauses:
> Aggregate [count(1) AS count(1)#xL]
> +- Project [col1#x AS inner_c#x]
>  +- SubqueryAlias `t0`
>  +- LocalRelation [col1#x]{code}
> But PostgreSQL supports this syntax.
>  
> {code:java}
> select (select count(*) filter (where outer_c <> 0)
>   from (values (1)) t0(inner_c))
> from (values (2),(3)) t1(outer_c); -- outer query is aggregation query
>  count 
> ---
>  2
> (1 row){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30072) Create dedicated planner for subqueries

2019-12-11 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994136#comment-16994136
 ] 

Wenchen Fan commented on SPARK-30072:
-

Can you be more specified about the problem? AFAIK `AdativeSparkPlanExec` is 
only created in the `InsertAdaptiveSpark` rule, and that rule sets the 
`isSubquery` flag correctly.

> Create dedicated planner for subqueries
> ---
>
> Key: SPARK-30072
> URL: https://issues.apache.org/jira/browse/SPARK-30072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Minor
> Fix For: 3.0.0
>
>
> This PR changes subquery planning by calling the planner and plan preparation 
> rules on the subquery plan directly. Before we were creating a QueryExecution 
> instance for subqueries to get the executedPlan. This would re-run analysis 
> and optimization on the subqueries plan. Running the analysis again on an 
> optimized query plan can have unwanted consequences, as some rules, for 
> example DecimalPrecision, are not idempotent.
> As an example, consider the expression 1.7 * avg(a) which after applying the 
> DecimalPrecision rule becomes:
> promote_precision(1.7) * promote_precision(avg(a))
> After the optimization, more specifically the constant folding rule, this 
> expression becomes:
> 1.7 * promote_precision(avg(a))
> Now if we run the analyzer on this optimized query again, we will get:
> promote_precision(1.7) * promote_precision(promote_precision(avg(a)))
> Which will later optimized as:
> 1.7 * promote_precision(promote_precision(avg(a)))
> As can be seen, re-running the analysis and optimization on this expression 
> results in an expression with extra nested promote_preceision nodes. Adding 
> unneeded nodes to the plan is problematic because it can eliminate situations 
> where we can reuse the plan.
> We opted to introduce dedicated planners for subuqueries, instead of making 
> the DecimalPrecision rule idempotent, because this eliminates this entire 
> category of problems. Another benefit is that planning time for subqueries is 
> reduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30213) Remove the mutable status in QueryStage when enable AQE

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30213:
-

Assignee: Ke Jia

> Remove the mutable status in QueryStage when enable AQE
> ---
>
> Key: SPARK-30213
> URL: https://issues.apache.org/jira/browse/SPARK-30213
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> Currently ShuffleQueryStageExec contain the mutable status, eg 
> mapOutputStatisticsFuture variable. So It is not easy to pass when we copy 
> ShuffleQueryStageExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30213) Remove the mutable status in QueryStage when enable AQE

2019-12-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30213.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26846
[https://github.com/apache/spark/pull/26846]

> Remove the mutable status in QueryStage when enable AQE
> ---
>
> Key: SPARK-30213
> URL: https://issues.apache.org/jira/browse/SPARK-30213
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently ShuffleQueryStageExec contain the mutable status, eg 
> mapOutputStatisticsFuture variable. So It is not easy to pass when we copy 
> ShuffleQueryStageExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2019-12-11 Thread kevin yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994166#comment-16994166
 ] 

kevin yu commented on SPARK-19335:
--

[~danny-seismic] [~Vdarshankb] [~nstudenski] [~mrayandutta] [~rinazbelhaj] 
[~drew222]: can you list the reasoning why your organization need this feature? 
We are assessing whether we should resume this work or not.

 

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30149) Schema Definition Spark Read

2019-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30149:
-
Priority: Major  (was: Blocker)

> Schema Definition Spark Read
> 
>
> Key: SPARK-30149
> URL: https://issues.apache.org/jira/browse/SPARK-30149
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Joby Joje
>Priority: Major
> Attachments: Output.txt, Schema.py, Test.csv, Test1.csv
>
>
> Reading a CSV file with defined schema I am able to load the files and do the 
> processing, which works fine using the below code. The schema is defined as 
> to strictly follow the datatype to record precision's with accuracy.
> {code:java}
> source_schema = StructType([source_schema = StructType([ StructField("NAME", 
> StringType(), True), StructField("AGE", StringType(), True), 
> StructField("GENDER", StringType(), True), StructField("PROFESSION", 
> StringType(), True), StructField("SALARY", DecimalType(38, 14), True), 
> StructField("BAD_RECORD", StringType(), True)]){code}
> {code:java}
> df_raw_file = sparksession.read \ 
> .format("csv") \ 
> .option("delimiter", '\t') \ 
> .option("header", "false") \ 
> .option("inferSchema", "true") \ 
> .option("columnNameOfCorruptRecord", "BAD_RECORD") \ 
> .schema(source_schema) \ .load(in_file_list) \ 
> .withColumn("LINE_NUMBER", monotonically_increasing_id()) \ 
> .withColumn("SOURCE_FILE_NAME", input_file_name()){code}
>  As per the Spark Documentation the mode is {{PERMISSIVE}} by default is its 
> not set and A record with less/more tokens than schema is not a corrupted 
> record to CSV. When it meets a record having fewer tokens than the length of 
> the schema, sets {{null}} to extra fields. When the record has more tokens 
> than the length of the schema, it drops extra tokens.
> {code:java}
> FILE SCHEMA TEST.CSVFILE SCHEMA TEST.CSV
> root 
> |-- NAME: string (nullable = true) 
> |-- AGE: string (nullable = true) 
> |-- GENDER: string (nullable = true) 
> |-- PROFESSION: string (nullable = true) 
> |-- SALARY: decimal(38,14) (nullable = true) 
> |-- BAD_RECORD: string (nullable = true) 
> |-- LINE_NUMBER: long (nullable = false) 
> |-- SOURCE_FILE_NAME: string (nullable = false)
> OUTPUT THE FILE TEST.CSV
> +--++--+--+--+-+---+-+
> |NAME  |AGE |GENDER|PROFESSION|SALARY|BAD_RECORD  
>  |LINE_NUMBER|SOURCE_FILE_NAME |
> +--++--+--+--+-+---+-+
> |null  |null|null  |null  |null  |NAMEAGE GENDER  
> PROFESSION  SALARY|0  |Test.CSV|
> |JOHN  |27  |MALE  |CEO   |300.1231423450|null
>  |1  |Test.CSV|
> |JUSTIN|67  |MALE  |CTO   |123.2345354345|null
>  |2  |Test.CSV|
> |SARAH |45  |FEMALE|CS|null  |null
>  |3  |Test.CSV|
> |SEAN  |66  |MALE  |CA|null  |SEAN66  MALE
> CA  |4  |Test.CSV|
> |PHIL  |34  |MALE  |null  |234.986986|null
>  |5  |Test.CSV|
> |null  |null|null  |null  |null  |JILL25  
> BOARD|6  |Test.CSV|
> |JACK  |30  |MALE  |BOARD |null  |JACK30  MALE
> BOARD   |7  |Test.CSV|
> +--++--+--+--+-+---+-+{code}
>  
> The TEST1.CSV doesnt have the SALARY column so it should have NULLED the 
> column and the BAD_RECORD column should be NULL for the rows, that doesnt 
> seem to happen and the values are considered as CORRUPT. 
> Also when it meets a corrupted record, puts the malformed string into a field 
> configured by {{columnNameOfCorruptRecord}}, and sets other fields to 
> {{null}} this is also not happening and I see this happening only for the 
> JILL row.
> {code:java}
> FILE SCHEMA TEST1.CSV 
> root 
> |-- NAME: string (nullable = true) 
> |-- AGE: string (nullable = true) 
> |-- GENDER: string (nullable = true) 
> |-- PROFESSION: string (nullable = true) 
> |-- SALARY: decimal(38,14) (nullable = true) 
> |-- BAD_RECORD: string (nullable = true) 
> |-- LINE_NUMBER: long (nullable = false) 
> |-- SOURCE_FILE_NAME: string (nullable = false)
> OUTPUT THE FILE TEST1.CSV
> +--++--+--+--+--+---+-+
>  |NAME |AGE |GENDER|PROFESSION|SALARY|BAD_RECORD 
> |LINE_NUMBER|SOURCE_FILE_NAME | 
> +--++--+---

[jira] [Commented] (SPARK-30149) Schema Definition Spark Read

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994168#comment-16994168
 ] 

Hyukjin Kwon commented on SPARK-30149:
--

Please don't set Critical+ which is usually reserved for committers.

> Schema Definition Spark Read
> 
>
> Key: SPARK-30149
> URL: https://issues.apache.org/jira/browse/SPARK-30149
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Joby Joje
>Priority: Major
> Attachments: Output.txt, Schema.py, Test.csv, Test1.csv
>
>
> Reading a CSV file with defined schema I am able to load the files and do the 
> processing, which works fine using the below code. The schema is defined as 
> to strictly follow the datatype to record precision's with accuracy.
> {code:java}
> source_schema = StructType([source_schema = StructType([ StructField("NAME", 
> StringType(), True), StructField("AGE", StringType(), True), 
> StructField("GENDER", StringType(), True), StructField("PROFESSION", 
> StringType(), True), StructField("SALARY", DecimalType(38, 14), True), 
> StructField("BAD_RECORD", StringType(), True)]){code}
> {code:java}
> df_raw_file = sparksession.read \ 
> .format("csv") \ 
> .option("delimiter", '\t') \ 
> .option("header", "false") \ 
> .option("inferSchema", "true") \ 
> .option("columnNameOfCorruptRecord", "BAD_RECORD") \ 
> .schema(source_schema) \ .load(in_file_list) \ 
> .withColumn("LINE_NUMBER", monotonically_increasing_id()) \ 
> .withColumn("SOURCE_FILE_NAME", input_file_name()){code}
>  As per the Spark Documentation the mode is {{PERMISSIVE}} by default is its 
> not set and A record with less/more tokens than schema is not a corrupted 
> record to CSV. When it meets a record having fewer tokens than the length of 
> the schema, sets {{null}} to extra fields. When the record has more tokens 
> than the length of the schema, it drops extra tokens.
> {code:java}
> FILE SCHEMA TEST.CSVFILE SCHEMA TEST.CSV
> root 
> |-- NAME: string (nullable = true) 
> |-- AGE: string (nullable = true) 
> |-- GENDER: string (nullable = true) 
> |-- PROFESSION: string (nullable = true) 
> |-- SALARY: decimal(38,14) (nullable = true) 
> |-- BAD_RECORD: string (nullable = true) 
> |-- LINE_NUMBER: long (nullable = false) 
> |-- SOURCE_FILE_NAME: string (nullable = false)
> OUTPUT THE FILE TEST.CSV
> +--++--+--+--+-+---+-+
> |NAME  |AGE |GENDER|PROFESSION|SALARY|BAD_RECORD  
>  |LINE_NUMBER|SOURCE_FILE_NAME |
> +--++--+--+--+-+---+-+
> |null  |null|null  |null  |null  |NAMEAGE GENDER  
> PROFESSION  SALARY|0  |Test.CSV|
> |JOHN  |27  |MALE  |CEO   |300.1231423450|null
>  |1  |Test.CSV|
> |JUSTIN|67  |MALE  |CTO   |123.2345354345|null
>  |2  |Test.CSV|
> |SARAH |45  |FEMALE|CS|null  |null
>  |3  |Test.CSV|
> |SEAN  |66  |MALE  |CA|null  |SEAN66  MALE
> CA  |4  |Test.CSV|
> |PHIL  |34  |MALE  |null  |234.986986|null
>  |5  |Test.CSV|
> |null  |null|null  |null  |null  |JILL25  
> BOARD|6  |Test.CSV|
> |JACK  |30  |MALE  |BOARD |null  |JACK30  MALE
> BOARD   |7  |Test.CSV|
> +--++--+--+--+-+---+-+{code}
>  
> The TEST1.CSV doesnt have the SALARY column so it should have NULLED the 
> column and the BAD_RECORD column should be NULL for the rows, that doesnt 
> seem to happen and the values are considered as CORRUPT. 
> Also when it meets a corrupted record, puts the malformed string into a field 
> configured by {{columnNameOfCorruptRecord}}, and sets other fields to 
> {{null}} this is also not happening and I see this happening only for the 
> JILL row.
> {code:java}
> FILE SCHEMA TEST1.CSV 
> root 
> |-- NAME: string (nullable = true) 
> |-- AGE: string (nullable = true) 
> |-- GENDER: string (nullable = true) 
> |-- PROFESSION: string (nullable = true) 
> |-- SALARY: decimal(38,14) (nullable = true) 
> |-- BAD_RECORD: string (nullable = true) 
> |-- LINE_NUMBER: long (nullable = false) 
> |-- SOURCE_FILE_NAME: string (nullable = false)
> OUTPUT THE FILE TEST1.CSV
> +--++--+--+--+--+---+-+
>  |NAME |AGE |GENDER|PRO

[jira] [Commented] (SPARK-30149) Schema Definition Spark Read

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994170#comment-16994170
 ] 

Hyukjin Kwon commented on SPARK-30149:
--

Can you try out in Spark 3 preview?

> Schema Definition Spark Read
> 
>
> Key: SPARK-30149
> URL: https://issues.apache.org/jira/browse/SPARK-30149
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Joby Joje
>Priority: Major
> Attachments: Output.txt, Schema.py, Test.csv, Test1.csv
>
>
> Reading a CSV file with defined schema I am able to load the files and do the 
> processing, which works fine using the below code. The schema is defined as 
> to strictly follow the datatype to record precision's with accuracy.
> {code:java}
> source_schema = StructType([source_schema = StructType([ StructField("NAME", 
> StringType(), True), StructField("AGE", StringType(), True), 
> StructField("GENDER", StringType(), True), StructField("PROFESSION", 
> StringType(), True), StructField("SALARY", DecimalType(38, 14), True), 
> StructField("BAD_RECORD", StringType(), True)]){code}
> {code:java}
> df_raw_file = sparksession.read \ 
> .format("csv") \ 
> .option("delimiter", '\t') \ 
> .option("header", "false") \ 
> .option("inferSchema", "true") \ 
> .option("columnNameOfCorruptRecord", "BAD_RECORD") \ 
> .schema(source_schema) \ .load(in_file_list) \ 
> .withColumn("LINE_NUMBER", monotonically_increasing_id()) \ 
> .withColumn("SOURCE_FILE_NAME", input_file_name()){code}
>  As per the Spark Documentation the mode is {{PERMISSIVE}} by default is its 
> not set and A record with less/more tokens than schema is not a corrupted 
> record to CSV. When it meets a record having fewer tokens than the length of 
> the schema, sets {{null}} to extra fields. When the record has more tokens 
> than the length of the schema, it drops extra tokens.
> {code:java}
> FILE SCHEMA TEST.CSVFILE SCHEMA TEST.CSV
> root 
> |-- NAME: string (nullable = true) 
> |-- AGE: string (nullable = true) 
> |-- GENDER: string (nullable = true) 
> |-- PROFESSION: string (nullable = true) 
> |-- SALARY: decimal(38,14) (nullable = true) 
> |-- BAD_RECORD: string (nullable = true) 
> |-- LINE_NUMBER: long (nullable = false) 
> |-- SOURCE_FILE_NAME: string (nullable = false)
> OUTPUT THE FILE TEST.CSV
> +--++--+--+--+-+---+-+
> |NAME  |AGE |GENDER|PROFESSION|SALARY|BAD_RECORD  
>  |LINE_NUMBER|SOURCE_FILE_NAME |
> +--++--+--+--+-+---+-+
> |null  |null|null  |null  |null  |NAMEAGE GENDER  
> PROFESSION  SALARY|0  |Test.CSV|
> |JOHN  |27  |MALE  |CEO   |300.1231423450|null
>  |1  |Test.CSV|
> |JUSTIN|67  |MALE  |CTO   |123.2345354345|null
>  |2  |Test.CSV|
> |SARAH |45  |FEMALE|CS|null  |null
>  |3  |Test.CSV|
> |SEAN  |66  |MALE  |CA|null  |SEAN66  MALE
> CA  |4  |Test.CSV|
> |PHIL  |34  |MALE  |null  |234.986986|null
>  |5  |Test.CSV|
> |null  |null|null  |null  |null  |JILL25  
> BOARD|6  |Test.CSV|
> |JACK  |30  |MALE  |BOARD |null  |JACK30  MALE
> BOARD   |7  |Test.CSV|
> +--++--+--+--+-+---+-+{code}
>  
> The TEST1.CSV doesnt have the SALARY column so it should have NULLED the 
> column and the BAD_RECORD column should be NULL for the rows, that doesnt 
> seem to happen and the values are considered as CORRUPT. 
> Also when it meets a corrupted record, puts the malformed string into a field 
> configured by {{columnNameOfCorruptRecord}}, and sets other fields to 
> {{null}} this is also not happening and I see this happening only for the 
> JILL row.
> {code:java}
> FILE SCHEMA TEST1.CSV 
> root 
> |-- NAME: string (nullable = true) 
> |-- AGE: string (nullable = true) 
> |-- GENDER: string (nullable = true) 
> |-- PROFESSION: string (nullable = true) 
> |-- SALARY: decimal(38,14) (nullable = true) 
> |-- BAD_RECORD: string (nullable = true) 
> |-- LINE_NUMBER: long (nullable = false) 
> |-- SOURCE_FILE_NAME: string (nullable = false)
> OUTPUT THE FILE TEST1.CSV
> +--++--+--+--+--+---+-+
>  |NAME |AGE |GENDER|PROFESSION|SALARY|BAD_RECORD 
> |LIN

[jira] [Commented] (SPARK-30139) get_json_object does not work correctly

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994171#comment-16994171
 ] 

Hyukjin Kwon commented on SPARK-30139:
--

[~rakson] Have you made some progresses on it?

> get_json_object does not work correctly
> ---
>
> Key: SPARK-30139
> URL: https://issues.apache.org/jira/browse/SPARK-30139
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Clemens Valiente
>Priority: Major
>
> according to documentation:
> [https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/functions.html#get_json_object-org.apache.spark.sql.Column-java.lang.String-]
> get_json_object "Extracts json object from a json string based on json path 
> specified, and returns json string of the extracted json object. It will 
> return null if the input json string is invalid."
>  
> the following SQL snippet returns null even though it should return 'a'
> {code}
> select get_json_object([{"id":123,"value":"a"},\{"id":456,"value":"b"}], 
> $[?($.id==123)].value){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30140) Code comment error

2019-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30140.
--
Resolution: Invalid

>  Code comment error
> ---
>
> Key: SPARK-30140
> URL: https://issues.apache.org/jira/browse/SPARK-30140
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: wuv1up
>Priority: Trivial
>
> ignore...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30137) Support DELETE file

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994172#comment-16994172
 ] 

Hyukjin Kwon commented on SPARK-30137:
--

[~sandeep.katta2007] Have you made some progresses on this?

> Support DELETE file 
> 
>
> Key: SPARK-30137
> URL: https://issues.apache.org/jira/browse/SPARK-30137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30130) Hardcoded numeric values in common table expressions which utilize GROUP BY are interpreted as ordinal positions

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994174#comment-16994174
 ] 

Hyukjin Kwon commented on SPARK-30130:
--

[~mboegner], can you also clarify which DBMSes you referred as of " this error 
does not appear in a traditional subselect format. "?

> Hardcoded numeric values in common table expressions which utilize GROUP BY 
> are interpreted as ordinal positions
> 
>
> Key: SPARK-30130
> URL: https://issues.apache.org/jira/browse/SPARK-30130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Matt Boegner
>Priority: Minor
>
> Hardcoded numeric values in common table expressions which utilize GROUP BY 
> are interpreted as ordinal positions.
> {code:java}
> val df = spark.sql("""
>  with a as (select 0 as test, count(*) group by test)
>  select * from a
>  """)
>  df.show(){code}
> This results in an error message like {color:#e01e5a}GROUP BY position 0 is 
> not in select list (valid range is [1, 2]){color} .
>  
> However, this error does not appear in a traditional subselect format. For 
> example, this query executes correctly:
> {code:java}
> val df = spark.sql("""
>  select * from (select 0 as test, count(*) group by test) a
>  """)
>  df.show(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30123) PartitionPruning should consider more case

2019-12-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994175#comment-16994175
 ] 

Hyukjin Kwon commented on SPARK-30123:
--

[~deshanxiao] can you show the plan generated from the actual codes? Also 
please fix the JIRA title to be more specific.

> PartitionPruning should consider more case
> --
>
> Key: SPARK-30123
> URL: https://issues.apache.org/jira/browse/SPARK-30123
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: deshanxiao
>Priority: Major
>
> If left has partitionScan and right has PruningFilter but hasBenefit is 
> false. The right will never be added a SubQuery.
> {code:java}
> var partScan = getPartitionTableScan(l, left)
> if (partScan.isDefined && canPruneLeft(joinType) &&
> hasPartitionPruningFilter(right)) {
>   val hasBenefit = pruningHasBenefit(l, partScan.get, r, right)
>   newLeft = insertPredicate(l, newLeft, r, right, rightKeys, 
> hasBenefit)
> } else {
>   partScan = getPartitionTableScan(r, right)
>   if (partScan.isDefined && canPruneRight(joinType) &&
>   hasPartitionPruningFilter(left) ) {
> val hasBenefit = pruningHasBenefit(r, partScan.get, l, left)
> newRight = insertPredicate(r, newRight, l, left, leftKeys, 
> hasBenefit)
>   }
> }
>   case _ =>
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30100) Decimal Precision Inferred from JDBC via Spark

2019-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30100:
-
Component/s: SQL

> Decimal Precision Inferred from JDBC via Spark
> --
>
> Key: SPARK-30100
> URL: https://issues.apache.org/jira/browse/SPARK-30100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Joby Joje
>Priority: Major
>
> When trying to load data from JDBC(Oracle) into Spark, there seems to be 
> precision loss in the decimal field, as per my understanding Spark supports 
> *DECIMAL(38,18)*. The field from the Oracle is DECIMAL(38,14), whereas Spark 
> rounds off the last four digits making it a precision of DECIMAL(38,10). This 
> is happening to few fields in the dataframe where the column is fetched using 
> a CASE statement whereas in the same query another field populates the right 
> schema.
> Tried to pass the
> {code:java}
> spark.sql.decimalOperations.allowPrecisionLoss=false{code}
> conf in the Spark-submit though didn't get the desired results.
> {code:java}
> jdbcDF = spark.read \ 
> .format("jdbc") \ 
> .option("url", "ORACLE") \ 
> .option("dbtable", "QUERY") \ 
> .option("user", "USERNAME") \ 
> .option("password", "PASSWORD") \ 
> .load(){code}
> So considering that the Spark infers the schema from a sample records, how 
> does this work here? Does it use the results of the query i.e (SELECT * FROM 
> TABLE_NAME JOIN ...) or does it take a different route to guess the schema 
> for itself? Can someone throw some light on this and advise how to achieve 
> the right decimal precision on this regards without manipulating the query as 
> doing a CAST on the query does solve the issue, but would prefer to get some 
> alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30096) drop function does not delete the jars file from tmp folder- session is not getting clear

2019-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30096:
-
Description: 
Steps:
 1. {{spark-master --master yarn}}
 2. {{spark-sql> create function addDoubles AS 
'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar 
'hdfs://hacluster/user/user1/AddDoublesUDF.jar';}}
 3. {{spark-sql> select addDoubles (1,2);}}

In the console log user can see
 Added [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] 
to class path
{code}
 19/12/02 13:49:33 INFO SessionState: Added 
[/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to 
class path
{code}


4. {{spark-sql>drop function AddDoubles;}}
 Check the tmp folder still session is not clear
{code:java}
vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # ll
total 11696
-rw-r--r-- 1 root root92660 Dec  2 13:49 .AddDoublesUDF.jar.crc
-rwxr-xr-x 1 root root 11859263 Dec  2 13:49 AddDoublesUDF.jar
vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources #
{code}

  was:
Steps:
1. {{spark-master --master yarn}}
2. {{spark-sql> create function addDoubles  AS 
'com.huawei.bigdata.hive.example.udf.AddDoublesUDF'  using jar 
'hdfs://hacluster/user/user1/AddDoublesUDF.jar';}}
3. {{spark-sql> select addDoubles (1,2);}}

In the console log user can see
Added [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] 
to class path
19/12/02 13:49:33 INFO SessionState: Added 
[/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to 
class path

4. {{spark-sql>drop function AddDoubles;}}
Check the tmp folder still session is not clear

{code}
vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # ll
total 11696
-rw-r--r-- 1 root root92660 Dec  2 13:49 .AddDoublesUDF.jar.crc
-rwxr-xr-x 1 root root 11859263 Dec  2 13:49 AddDoublesUDF.jar
vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources #
{code}




> drop function does not delete the jars file from tmp folder- session is not 
> getting clear
> -
>
> Key: SPARK-30096
> URL: https://issues.apache.org/jira/browse/SPARK-30096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Steps:
>  1. {{spark-master --master yarn}}
>  2. {{spark-sql> create function addDoubles AS 
> 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar 
> 'hdfs://hacluster/user/user1/AddDoublesUDF.jar';}}
>  3. {{spark-sql> select addDoubles (1,2);}}
> In the console log user can see
>  Added 
> [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to 
> class path
> {code}
>  19/12/02 13:49:33 INFO SessionState: Added 
> [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to 
> class path
> {code}
> 4. {{spark-sql>drop function AddDoubles;}}
>  Check the tmp folder still session is not clear
> {code:java}
> vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # ll
> total 11696
> -rw-r--r-- 1 root root92660 Dec  2 13:49 .AddDoublesUDF.jar.crc
> -rwxr-xr-x 1 root root 11859263 Dec  2 13:49 AddDoublesUDF.jar
> vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources #
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30096) drop function does not delete the jars file from tmp folder- session is not getting clear

2019-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30096:
-
Description: 
Steps:
1. {{spark-master --master yarn}}
2. {{spark-sql> create function addDoubles  AS 
'com.huawei.bigdata.hive.example.udf.AddDoublesUDF'  using jar 
'hdfs://hacluster/user/user1/AddDoublesUDF.jar';}}
3. {{spark-sql> select addDoubles (1,2);}}

In the console log user can see
Added [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] 
to class path
19/12/02 13:49:33 INFO SessionState: Added 
[/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to 
class path

4. {{spark-sql>drop function AddDoubles;}}
Check the tmp folder still session is not clear

{code}
vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # ll
total 11696
-rw-r--r-- 1 root root92660 Dec  2 13:49 .AddDoublesUDF.jar.crc
-rwxr-xr-x 1 root root 11859263 Dec  2 13:49 AddDoublesUDF.jar
vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources #
{code}



  was:
Steps:
1.spark-master --master yarn
2. spark-sql> create function addDoubles  AS 
'com.huawei.bigdata.hive.example.udf.AddDoublesUDF'  using jar 
'hdfs://hacluster/user/user1/AddDoublesUDF.jar';
3. spark-sql> select addDoubles (1,2);

In the console log user can see
Added [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] 
to class path
19/12/02 13:49:33 INFO SessionState: Added 
[/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to 
class path

4. spark-sql>drop function AddDoubles;
Check the tmp folder still session is not clear

vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # ll
total 11696
-rw-r--r-- 1 root root92660 Dec  2 13:49 .AddDoublesUDF.jar.crc
-rwxr-xr-x 1 root root 11859263 Dec  2 13:49 AddDoublesUDF.jar
vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources #




> drop function does not delete the jars file from tmp folder- session is not 
> getting clear
> -
>
> Key: SPARK-30096
> URL: https://issues.apache.org/jira/browse/SPARK-30096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Steps:
> 1. {{spark-master --master yarn}}
> 2. {{spark-sql> create function addDoubles  AS 
> 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF'  using jar 
> 'hdfs://hacluster/user/user1/AddDoublesUDF.jar';}}
> 3. {{spark-sql> select addDoubles (1,2);}}
> In the console log user can see
> Added [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] 
> to class path
> 19/12/02 13:49:33 INFO SessionState: Added 
> [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to 
> class path
> 4. {{spark-sql>drop function AddDoubles;}}
> Check the tmp folder still session is not clear
> {code}
> vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # ll
> total 11696
> -rw-r--r-- 1 root root92660 Dec  2 13:49 .AddDoublesUDF.jar.crc
> -rwxr-xr-x 1 root root 11859263 Dec  2 13:49 AddDoublesUDF.jar
> vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources #
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30149) Schema Definition Spark Read

2019-12-11 Thread Joby Joje (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994226#comment-16994226
 ] 

Joby Joje commented on SPARK-30149:
---

[~hyukjin.kwon] Sorry didn't know Critical was for committers, will see to it 
that its not repeated. With respect to testing it in the Spark 3, we are using 
the Spark 2.4 latest stable version and going into the preview version wouldn't 
be a great idea until a stable Spark 3 version is released. Though I can try to 
give it a try locally just to confirm on it.

> Schema Definition Spark Read
> 
>
> Key: SPARK-30149
> URL: https://issues.apache.org/jira/browse/SPARK-30149
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Joby Joje
>Priority: Major
> Attachments: Output.txt, Schema.py, Test.csv, Test1.csv
>
>
> Reading a CSV file with defined schema I am able to load the files and do the 
> processing, which works fine using the below code. The schema is defined as 
> to strictly follow the datatype to record precision's with accuracy.
> {code:java}
> source_schema = StructType([source_schema = StructType([ StructField("NAME", 
> StringType(), True), StructField("AGE", StringType(), True), 
> StructField("GENDER", StringType(), True), StructField("PROFESSION", 
> StringType(), True), StructField("SALARY", DecimalType(38, 14), True), 
> StructField("BAD_RECORD", StringType(), True)]){code}
> {code:java}
> df_raw_file = sparksession.read \ 
> .format("csv") \ 
> .option("delimiter", '\t') \ 
> .option("header", "false") \ 
> .option("inferSchema", "true") \ 
> .option("columnNameOfCorruptRecord", "BAD_RECORD") \ 
> .schema(source_schema) \ .load(in_file_list) \ 
> .withColumn("LINE_NUMBER", monotonically_increasing_id()) \ 
> .withColumn("SOURCE_FILE_NAME", input_file_name()){code}
>  As per the Spark Documentation the mode is {{PERMISSIVE}} by default is its 
> not set and A record with less/more tokens than schema is not a corrupted 
> record to CSV. When it meets a record having fewer tokens than the length of 
> the schema, sets {{null}} to extra fields. When the record has more tokens 
> than the length of the schema, it drops extra tokens.
> {code:java}
> FILE SCHEMA TEST.CSVFILE SCHEMA TEST.CSV
> root 
> |-- NAME: string (nullable = true) 
> |-- AGE: string (nullable = true) 
> |-- GENDER: string (nullable = true) 
> |-- PROFESSION: string (nullable = true) 
> |-- SALARY: decimal(38,14) (nullable = true) 
> |-- BAD_RECORD: string (nullable = true) 
> |-- LINE_NUMBER: long (nullable = false) 
> |-- SOURCE_FILE_NAME: string (nullable = false)
> OUTPUT THE FILE TEST.CSV
> +--++--+--+--+-+---+-+
> |NAME  |AGE |GENDER|PROFESSION|SALARY|BAD_RECORD  
>  |LINE_NUMBER|SOURCE_FILE_NAME |
> +--++--+--+--+-+---+-+
> |null  |null|null  |null  |null  |NAMEAGE GENDER  
> PROFESSION  SALARY|0  |Test.CSV|
> |JOHN  |27  |MALE  |CEO   |300.1231423450|null
>  |1  |Test.CSV|
> |JUSTIN|67  |MALE  |CTO   |123.2345354345|null
>  |2  |Test.CSV|
> |SARAH |45  |FEMALE|CS|null  |null
>  |3  |Test.CSV|
> |SEAN  |66  |MALE  |CA|null  |SEAN66  MALE
> CA  |4  |Test.CSV|
> |PHIL  |34  |MALE  |null  |234.986986|null
>  |5  |Test.CSV|
> |null  |null|null  |null  |null  |JILL25  
> BOARD|6  |Test.CSV|
> |JACK  |30  |MALE  |BOARD |null  |JACK30  MALE
> BOARD   |7  |Test.CSV|
> +--++--+--+--+-+---+-+{code}
>  
> The TEST1.CSV doesnt have the SALARY column so it should have NULLED the 
> column and the BAD_RECORD column should be NULL for the rows, that doesnt 
> seem to happen and the values are considered as CORRUPT. 
> Also when it meets a corrupted record, puts the malformed string into a field 
> configured by {{columnNameOfCorruptRecord}}, and sets other fields to 
> {{null}} this is also not happening and I see this happening only for the 
> JILL row.
> {code:java}
> FILE SCHEMA TEST1.CSV 
> root 
> |-- NAME: string (nullable = true) 
> |-- AGE: string (nullable = true) 
> |-- GENDER: string (nullable = true) 
> |-- PROFESSION: string (nullable = true) 
> |-- SALARY: decimal(38,14) (nullable = true)

[jira] [Resolved] (SPARK-30228) Update zstd-jni to 1.4.4-3

2019-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30228.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26856
[https://github.com/apache/spark/pull/26856]

> Update zstd-jni to 1.4.4-3
> --
>
> Key: SPARK-30228
> URL: https://issues.apache.org/jira/browse/SPARK-30228
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30228) Update zstd-jni to 1.4.4-3

2019-12-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30228:


Assignee: Dongjoon Hyun

> Update zstd-jni to 1.4.4-3
> --
>
> Key: SPARK-30228
> URL: https://issues.apache.org/jira/browse/SPARK-30228
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30229) java.lang.NullPointerException at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1783)

2019-12-11 Thread SeaAndHill (Jira)
SeaAndHill created SPARK-30229:
--

 Summary: java.lang.NullPointerException at 
org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1783)
 Key: SPARK-30229
 URL: https://issues.apache.org/jira/browse/SPARK-30229
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: SeaAndHill


2019-12-12 11:52:00 INFO JobScheduler:54 - Added jobs for time 157612272 ms
2019-12-12 11:52:00 INFO JobScheduler:54 - Starting job streaming job 
157612272 ms.0 from job set of time 157612272 ms
2019-12-12 11:52:00 INFO CarbonSparkSqlParser:54 - Parsing command: 
event_detail_temp
2019-12-12 11:52:00 INFO CarbonLateDecodeRule:95 - skip CarbonOptimizer
2019-12-12 11:52:00 INFO CarbonLateDecodeRule:72 - Skip CarbonOptimizer
2019-12-12 11:52:00 INFO CarbonLateDecodeRule:95 - skip CarbonOptimizer
2019-12-12 11:52:00 INFO CarbonLateDecodeRule:72 - Skip CarbonOptimizer
2019-12-12 11:52:00 INFO JobScheduler:54 - Finished job streaming job 
157612272 ms.0 from job set of time 157612272 ms
2019-12-12 11:52:00 ERROR JobScheduler:91 - Error running job streaming job 
157612272 ms.0
java.lang.NullPointerException
 at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1783)
 at 
org.apache.spark.rdd.DefaultPartitionCoalescer.currPrefLocs(CoalescedRDD.scala:178)
 at 
org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations$$anonfun$getAllPrefLocs$2.apply(CoalescedRDD.scala:196)
 at 
org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations$$anonfun$getAllPrefLocs$2.apply(CoalescedRDD.scala:195)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 at 
org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations.getAllPrefLocs(CoalescedRDD.scala:195)
 at 
org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations.(CoalescedRDD.scala:188)
 at 
org.apache.spark.rdd.DefaultPartitionCoalescer.coalesce(CoalescedRDD.scala:391)
 at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:91)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
 at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
 at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
 at 
org.apache.spark.sql.Dataset$$anonfun$collectAsList$1.apply(Dataset.scala:2399)
 at 
org.apache.spark.sql.Dataset$$anonfun$collectAsList$1.apply(Dataset.scala:2398)
 at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
 at 
org.apache.spark.sql.execution.SQLExe

  1   2   >