[jira] [Commented] (SPARK-31420) Infinite timeline redraw in job details page

2020-04-10 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081188#comment-17081188
 ] 

Kousuke Saruta commented on SPARK-31420:


O.K, I'll look into this.

> Infinite timeline redraw in job details page
> 
>
> Key: SPARK-31420
> URL: https://issues.apache.org/jira/browse/SPARK-31420
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Kousuke Saruta
>Priority: Major
> Attachments: timeline.mov
>
>
> In the job page, the timeline section keeps changing the position style and 
> shaking. We can see that there is a warning "infinite loop in redraw" from 
> the console, which can be related to 
> https://github.com/visjs/vis-timeline/issues/17
> I am using the history server with the events under 
> "core/src/test/resources/spark-events" to reproduce.
> I have also uploaded a screen recording.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31422) Fix NPE when BlockManagerSource is used after BlockManagerMaster stops

2020-04-10 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-31422:
-

 Summary: Fix NPE when BlockManagerSource is used after 
BlockManagerMaster stops
 Key: SPARK-31422
 URL: https://issues.apache.org/jira/browse/SPARK-31422
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.5, 2.3.4, 3.0.0
Reporter: Dongjoon Hyun


In `SparkEnv.stop`, the following stop sequence is used.
{code}
blockManager.stop()
blockManager.master.stop()
metricsSystem.stop()
{code}

During `metricsSystem.stop`, some sink can invoke `BlockManagerSource` and ends 
up with NPE.
{code}
sinks.foreach(_.stop)
registry.removeMatching((_: String, _: Metric) => true)
{code}

{code}
java.lang.NullPointerException
at 
org.apache.spark.storage.BlockManagerMaster.getStorageStatus(BlockManagerMaster.scala:170)
at 
org.apache.spark.storage.BlockManagerSource$$anonfun$10.apply(BlockManagerSource.scala:63)
at 
org.apache.spark.storage.BlockManagerSource$$anonfun$10.apply(BlockManagerSource.scala:63)
at 
org.apache.spark.storage.BlockManagerSource$$anon$1.getValue(BlockManagerSource.scala:31)
at 
org.apache.spark.storage.BlockManagerSource$$anon$1.getValue(BlockManagerSource.scala:30)
{code}

Since SparkContext registers and forgets `BlockManagerSource` without 
deregistering, we had better avoid NPE inside `BlockManagerMaster`.
{code}
_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient

2020-04-10 Thread Fokko Driesprong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081174#comment-17081174
 ] 

Fokko Driesprong commented on SPARK-29245:
--

Thanks, on Iceberg we have a similar issue: 
[https://github.com/apache/incubator-iceberg/pull/577]

For reference.

> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}
> With 2.3.7-SNAPSHOT, the following basic tests are tested.
> - SHOW DATABASES / TABLES
> - DESC DATABASE / TABLE
> - CREATE / DROP / USE DATABASE
> - CREATE / DROP / INSERT / LOAD / SELECT TABLE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31421) transformAllExpressions can not transform expressions in ScalarSubquery

2020-04-10 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-31421:
---

 Summary: transformAllExpressions can not transform expressions in 
ScalarSubquery
 Key: SPARK-31421
 URL: https://issues.apache.org/jira/browse/SPARK-31421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


Example:
{code:scala}
spark.sql("CREATE TABLE t1 (c1 bigint, c2 bigint)")
val sqlStr = "SELECT COUNT(*) FROM t1 HAVING SUM(c1) > (SELECT AVG(c2) FROM t1)"
println(normalizeExprIds(spark.sql(sqlStr).queryExecution.optimizedPlan).treeString)
{code}
The output:
{noformat}
Project [count(1)#0L]
+- Filter (isnotnull(sum(c1#2L)#0L) AND (cast(sum(c1#2L)#0L as double) > 
scalar-subquery#0 []))
   :  +- Aggregate [avg(c2#3L) AS avg(c2)#6]
   : +- Project [c2#3L]
   :+- Relation[c1#2L,c2#3L] parquet
   +- Aggregate [count(1) AS count(1)#0L, sum(c1#0L) AS sum(c1#2L)#0L]
  +- Project [c1#0L]
 +- Relation[c1#0L,c2#0L] parquet
{noformat}

Another example query:
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q14b.sql





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31256) Dropna doesn't work for struct columns

2020-04-10 Thread JinxinTang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JinxinTang updated SPARK-31256:
---
Comment: was deleted

(was: I will try to solve it)

> Dropna doesn't work for struct columns
> --
>
> Key: SPARK-31256
> URL: https://issues.apache.org/jira/browse/SPARK-31256
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
> Python 3.7.4
>Reporter: Michael Souder
>Priority: Major
>
> Dropna using a subset with a column from a struct drops the entire data frame.
> {code:python}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, 
> None)], schema=['age', 'height', 'name'])
> df.show()
> +---+--+-+
> |age|height| name|
> +---+--+-+
> |  5|80|Alice|
> | 10|  null|  Bob|
> | 15|80| null|
> +---+--+-+
> # this works just fine
> df.dropna(subset=['name']).show()
> +---+--+-+
> |age|height| name|
> +---+--+-+
> |  5|80|Alice|
> | 10|  null|  Bob|
> +---+--+-+
> # now add a struct column
> df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', 
> 'name'))
> df_with_struct.show(truncate=False)
> +---+--+-+--+
> |age|height|name |struct_col|
> +---+--+-+--+
> |5  |80|Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]|
> |15 |80|null |[15, 80,] |
> +---+--+-+--+
> # now dropna drops the whole dataframe when you use struct_col
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+--++--+
> |age|height|name|struct_col|
> +---+--++--+
> +---+--++--+
> {code}
>  I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 
> with python 3.6.8 and in both, the result looks like:
> {code:python}
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+--+-+--+
> |age|height|name |struct_col|
> +---+--+-+--+
> |5  |80|Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]|
> +---+--+-+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31256) Dropna doesn't work for struct columns

2020-04-10 Thread JinxinTang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081084#comment-17081084
 ] 

JinxinTang commented on SPARK-31256:


I will try to solve it

> Dropna doesn't work for struct columns
> --
>
> Key: SPARK-31256
> URL: https://issues.apache.org/jira/browse/SPARK-31256
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
> Python 3.7.4
>Reporter: Michael Souder
>Priority: Major
>
> Dropna using a subset with a column from a struct drops the entire data frame.
> {code:python}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, 
> None)], schema=['age', 'height', 'name'])
> df.show()
> +---+--+-+
> |age|height| name|
> +---+--+-+
> |  5|80|Alice|
> | 10|  null|  Bob|
> | 15|80| null|
> +---+--+-+
> # this works just fine
> df.dropna(subset=['name']).show()
> +---+--+-+
> |age|height| name|
> +---+--+-+
> |  5|80|Alice|
> | 10|  null|  Bob|
> +---+--+-+
> # now add a struct column
> df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', 
> 'name'))
> df_with_struct.show(truncate=False)
> +---+--+-+--+
> |age|height|name |struct_col|
> +---+--+-+--+
> |5  |80|Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]|
> |15 |80|null |[15, 80,] |
> +---+--+-+--+
> # now dropna drops the whole dataframe when you use struct_col
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+--++--+
> |age|height|name|struct_col|
> +---+--++--+
> +---+--++--+
> {code}
>  I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 
> with python 3.6.8 and in both, the result looks like:
> {code:python}
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+--+-+--+
> |age|height|name |struct_col|
> +---+--+-+--+
> |5  |80|Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]|
> +---+--+-+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25154) Support NOT IN sub-queries inside nested OR conditions.

2020-04-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-25154.
--
Fix Version/s: 3.1.0
 Assignee: Dilip Biswal
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28158

> Support NOT IN sub-queries inside nested OR conditions.
> ---
>
> Key: SPARK-25154
> URL: https://issues.apache.org/jira/browse/SPARK-25154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently NOT IN subqueries (predicated null aware subquery) are not allowed 
> inside OR expressions. We currently catch this condition in checkAnalysis and 
> throw an error. 
> We need to extend our subquery rewrite frame work to support this type of 
> queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25154) Support NOT IN sub-queries inside nested OR conditions.

2020-04-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-25154:
-
Issue Type: Improvement  (was: Bug)

> Support NOT IN sub-queries inside nested OR conditions.
> ---
>
> Key: SPARK-25154
> URL: https://issues.apache.org/jira/browse/SPARK-25154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently NOT IN subqueries (predicated null aware subquery) are not allowed 
> inside OR expressions. We currently catch this condition in checkAnalysis and 
> throw an error. 
> We need to extend our subquery rewrite frame work to support this type of 
> queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31420) Infinite timeline redraw in job details page

2020-04-10 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081017#comment-17081017
 ] 

Gengliang Wang commented on SPARK-31420:


[~sarutak] I looked into it but I am not familiar with this part. Could you 
please check it?

> Infinite timeline redraw in job details page
> 
>
> Key: SPARK-31420
> URL: https://issues.apache.org/jira/browse/SPARK-31420
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Kousuke Saruta
>Priority: Major
> Attachments: timeline.mov
>
>
> In the job page, the timeline section keeps changing the position style and 
> shaking. We can see that there is a warning "infinite loop in redraw" from 
> the console, which can be related to 
> https://github.com/visjs/vis-timeline/issues/17
> I am using the history server with the events under 
> "core/src/test/resources/spark-events" to reproduce.
> I have also uploaded a screen recording.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31420) Infinite timeline redraw in job details page

2020-04-10 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-31420:
--

 Summary: Infinite timeline redraw in job details page
 Key: SPARK-31420
 URL: https://issues.apache.org/jira/browse/SPARK-31420
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.0, 3.1.0
Reporter: Gengliang Wang
Assignee: Kousuke Saruta
 Attachments: timeline.mov

In the job page, the timeline section keeps changing the position style and 
shaking. We can see that there is a warning "infinite loop in redraw" from the 
console, which can be related to https://github.com/visjs/vis-timeline/issues/17

I am using the history server with the events under 
"core/src/test/resources/spark-events" to reproduce.
I have also uploaded a screen recording.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31420) Infinite timeline redraw in job details page

2020-04-10 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-31420:
---
Attachment: timeline.mov

> Infinite timeline redraw in job details page
> 
>
> Key: SPARK-31420
> URL: https://issues.apache.org/jira/browse/SPARK-31420
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Kousuke Saruta
>Priority: Major
> Attachments: timeline.mov
>
>
> In the job page, the timeline section keeps changing the position style and 
> shaking. We can see that there is a warning "infinite loop in redraw" from 
> the console, which can be related to 
> https://github.com/visjs/vis-timeline/issues/17
> I am using the history server with the events under 
> "core/src/test/resources/spark-events" to reproduce.
> I have also uploaded a screen recording.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31419) Document Table-valued Function and Inline Table

2020-04-10 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31419:
--

 Summary: Document Table-valued Function and Inline Table
 Key: SPARK-31419
 URL: https://issues.apache.org/jira/browse/SPARK-31419
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Document Table-valued Function and Inline Table



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code

2020-04-10 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-31417:
-
Description: 
*I would like some help to explore if the following feature makes sense and 
what's the best way to implement it. Allow users to use "isin" (or similar) 
predicates with huge Lists (which might be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered, most of the data has the same 
nature and thus is hard to filter, unless I use this "200.000" list) on my 
table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, without creating a new predicate 
*anyone has any idea on how to extend current case class "In" while keeping the 
compatibility with PushDown implementations, but not having only 
Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* 
This is what makes the tasks huge, I made a test-run with a predicate which 
receives and Broadcast variable and uses the value inside, and it works much 
better.

  was:
*I would like some help to explore if the following feature makes sense and 
what's the best way to implement it. Allow users to use "isin" (or similar) 
predicates with huge Lists (which might be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered, most of the data has the same 
nature and thus is hard to filter, unless I use this "200.000" list) on my 
table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.


> Allow broadcast variables when using isin code
> --
>
> Key: SPARK-31417
> URL: https://issues.apache.org/jira/browse/SPARK-31417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0

[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code

2020-04-10 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-31417:
-
Description: 
*I would like some help to explore if the following feature makes sense and 
what's the best way to implement it. Allow users to use "isin" (or similar) 
predicates with huge Lists (which might be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered, most of the data has the same 
nature and thus is hard to filter, unless I use this "200.000" list) on my 
table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations (aka creating a new Predicate not allowed I think), but not 
having only Seq[Expression] in the case class, but also allow 
Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a 
test-run with a predicate which receives and Broadcast variable and uses the 
value inside, and it works much better.

  was:
*I would like some help to explore if the following feature makes sense and 
what's the best way to implement it. Allow users to use "isin" (or similar) 
predicates with huge Lists (which might be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered, most of the data has the same 
nature and thus is hard to filter, unless I use this "200.000" list) on my 
table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, without creating a new predicate 
*anyone has any idea on how to extend current case class "In" while keeping the 
compatibility with PushDown implementations, but not having only 
Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* 
This is what makes the tasks huge, I made a test-run with a predicate which 
receives and Broadcast variable and uses the value inside, and it works much 
better.


> Allow broadcast variables when using isin code
> --
>
> Key: SPARK-31417
> URL: https://issues.apache.org/jira/browse/SPARK-31417
> Project: Spark
>  Issue Type: Improvement
>  

[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code

2020-04-10 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-31417:
-
Description: 
*I would like some help to explore if the following feature makes sense and 
what's the best way to implement it. Allow users to use "isin" (or similar) 
predicates with huge Lists (which might be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered, most of the data has the same 
nature and thus is hard to filter, unless I use this "200.000" list) on my 
table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.

  was:
*I would like some help to explore if the following feature makes sense and 
what's the best way to implement it. Allow users to use "isin" (or similar) 
predicates with huge Lists (which might be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered, most of the data has the same 
nature and thus is hard to filter, unless if I use this "200.000" list) on my 
table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.


> Allow broadcast variables when using isin code
> --
>
> Key: SPARK-31417
> URL: https://issues.apache.org/jira/browse/SPARK-31417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Florenti

[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code

2020-04-10 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-31417:
-
Description: 
*I would like some help to explore if the following feature makes sense and 
what's the best way to implement it. Allow users to use "isin" (or similar) 
predicates with huge Lists (which might be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered, most of the data has the same 
nature and thus is hard to filter, unless if I use this "200.000" list) on my 
table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.

  was:
*I would like some help to explore if the following feature makes sense and 
what's the best way to implement it. Allow users to use "isin" (or similar) 
predicates with huge Lists (which might be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.


> Allow broadcast variables when using isin code
> --
>
> Key: SPARK-31417
> URL: https://issues.apache.org/jira/browse/SPARK-31417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Florentino Sainz
>Priority: Minor
>
> *I would like some help to explore if the following feature

[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code

2020-04-10 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-31417:
-
Description: 
*I would like some help to explore if the following feature makes sense and 
what's the best way to implement it. Allow users to use "isin" (or similar) 
predicates with huge Lists (which might be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.

  was:
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with huge Lists (which might 
be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.


> Allow broadcast variables when using isin code
> --
>
> Key: SPARK-31417
> URL: https://issues.apache.org/jira/browse/SPARK-31417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Florentino Sainz
>Priority: Minor
>
> *I would like some help to explore if the following feature makes sense and 
> what's the best way to implement it. Allow users to use "isin" (or similar) 
> predicates with huge Lists (which might 

[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code

2020-04-10 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-31417:
-
Description: 
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with huge Lists (which might 
be broadcasted).*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.

  was:
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.


> Allow broadcast variables when using isin code
> --
>
> Key: SPARK-31417
> URL: https://issues.apache.org/jira/browse/SPARK-31417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Florentino Sainz
>Priority: Minor
>
> *I would like to for someone to explore if the following feature makes sense. 
> Allow users to use "isin" (or similar) predicates with huge Lists (which 
> might be broadcasted).*
> As of now (AFAIK), users can only provide a list/sequence of e

[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-04-10 Thread Venkata krishnan Sowrirajan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080942#comment-17080942
 ] 

Venkata krishnan Sowrirajan commented on SPARK-31418:
-

Currently, I'm working on this issue.

> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> 
>
> Key: SPARK-31418
> URL: https://issues.apache.org/jira/browse/SPARK-31418
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code

2020-04-10 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-31417:
-
Description: 
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having the List as a val in the case class?* This is 
what makes the tasks huge, I made a test-run with a predicate which receives 
and Broadcast variable and uses the value inside, and it works much better.

  was:
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any clue If it's possible 
to extend current case class "In" while keeping the compatibility with PushDown 
implementations, but not having the List as a val in the case class?* This is 
what makes the tasks huge, I made a test-run with a predicate which receives 
and Broadcast variable and uses the value inside, and it works much better.


> Allow broadcast variables when using isin code
> --
>
> Key: SPARK-31417
> URL: https://issues.apache.org/jira/browse/SPARK-31417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Florentino Sainz
>Priority: Minor
>
> *I would like to for someone to explore if the following feature makes sense. 
> Allow users to use "isin" (or similar) predicates with Lists which are 
> broadcasted.*
> As of now (AFAIK), users can only provide a list/sequence of elements which 
> will be sent as part of the "In" predicate, which is part of the task, to the 
> executors.

[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code

2020-04-10 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-31417:
-
Description: 
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.

  was:
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having the List as a val in the case class?* This is 
what makes the tasks huge, I made a test-run with a predicate which receives 
and Broadcast variable and uses the value inside, and it works much better.


> Allow broadcast variables when using isin code
> --
>
> Key: SPARK-31417
> URL: https://issues.apache.org/jira/browse/SPARK-31417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Florentino Sainz
>Priority: Minor
>
> *I would like to for someone to explore if the following feature makes sense. 
> Allow users to use "isin" (or similar) predicates with Lists which are 
> broadcasted.*
> As of now (AFAIK), users can only provide a list/sequence of elements which 
> will be sent as part of the "In" predicate, which is p

[jira] [Updated] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-04-10 Thread Venkata krishnan Sowrirajan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata krishnan Sowrirajan updated SPARK-31418:

Description: 
With Spark blacklisting, if a task fails on an executor, the executor gets 
blacklisted for the task. In order to retry the task, it checks if there are 
idle blacklisted executor which can be killed and replaced to retry the task if 
not it aborts the job without doing max retries.

In the context of dynamic allocation this can be better, instead of killing the 
blacklisted idle executor (its possible there are no idle blacklisted 
executor), request an additional executor and retry the task.

This can be easily reproduced with a simple job like below, although this 
example should fail eventually just to show that its not retried 
spark.task.maxFailures times: 

{code:java}
def test(a: Int) = { a.asInstanceOf[String] }
sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
{code}

with dynamic allocation enabled and min executors set to 1. But there are 
various other cases where this can fail as well.

  was:
With Spark blacklisting, if a task fails on an executor, the executor gets 
blacklisted for the task. In order to retry the task, it checks if there are 
idle blacklisted executor which can be killed and replaced to retry the task if 
not it aborts the job without doing max retries.

In the context of dynamic allocation this can be better, instead of killing the 
blacklisted idle executor (its possible there are no idle blacklisted 
executor), request an additional executor and retry the task.

This can be easily reproduced with a simple job like below: 

{code:java}
def test(a: Int) = { a.asInstanceOf[String] }
sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
{code}

with dynamic allocation enabled and min executors set to 1. But there are 
various other cases where this can fail as well.


> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> 
>
> Key: SPARK-31418
> URL: https://issues.apache.org/jira/browse/SPARK-31418
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code

2020-04-10 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-31417:
-
Description: 
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any clue If it's possible 
to extend current case class "In" while keeping the compatibility with PushDown 
implementations, but not having the List as a val in the case class?* This is 
what makes the tasks huge, I made a test-run with a predicate which receives 
and Broadcast variable and uses the value inside, and it works much better.

  was:
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*


I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any clue If it's possible 
to extend current case class "In" while keeping the compatibility with PushDown 
implementations, but not having the List as a val in the case class?* This is 
what makes the tasks huge, I made a test-run with a predicate which receives 
and Broadcast variable and uses the value inside, and it works much better.


> Allow broadcast variables when using isin code
> --
>
> Key: SPARK-31417
> URL: https://issues.apache.org/jira/browse/SPARK-31417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Florentino Sainz
>Priority: Minor
>
> *I would like to for someone to explore if the following feature makes sense. 
> Allow users to use "isin" (or similar) predicates with Lists which are 
> broadcasted.*
> As of now (AFAIK), users can only provide a list/sequence of elements which 
> will be sent as part of the "In" predicate, which is part of the task, to the 
> executors. So when this list is huge, this causes tasks to be "huge" 
> (specially when the task number is big).
> I'm coming from 
> https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
> (I'm the creator of the post, and also the 2nd answer

[jira] [Created] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-04-10 Thread Venkata krishnan Sowrirajan (Jira)
Venkata krishnan Sowrirajan created SPARK-31418:
---

 Summary: Blacklisting feature aborts Spark job without retrying 
for max num retries in case of Dynamic allocation
 Key: SPARK-31418
 URL: https://issues.apache.org/jira/browse/SPARK-31418
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.5, 2.3.0
Reporter: Venkata krishnan Sowrirajan


With Spark blacklisting, if a task fails on an executor, the executor gets 
blacklisted for the task. In order to retry the task, it checks if there are 
idle blacklisted executor which can be killed and replaced to retry the task if 
not it aborts the job without doing max retries.

In the context of dynamic allocation this can be better, instead of killing the 
blacklisted idle executor (its possible there are no idle blacklisted 
executor), request an additional executor and retry the task.

This can be easily reproduced with a simple job like below: 

{code:java}
def test(a: Int) = { a.asInstanceOf[String] }
sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
{code}

with dynamic allocation enabled and min executors set to 1. But there are 
various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31417) Allow broadcast variables when using isin code

2020-04-10 Thread Florentino Sainz (Jira)
Florentino Sainz created SPARK-31417:


 Summary: Allow broadcast variables when using isin code
 Key: SPARK-31417
 URL: https://issues.apache.org/jira/browse/SPARK-31417
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: Florentino Sainz


*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*


I'm coming from 
https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any clue If it's possible 
to extend current case class "In" while keeping the compatibility with PushDown 
implementations, but not having the List as a val in the case class?* This is 
what makes the tasks huge, I made a test-run with a predicate which receives 
and Broadcast variable and uses the value inside, and it works much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31416) Check more strictly that a field name can be used as a valid Java identifier for codegen

2020-04-10 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-31416:
--

 Summary: Check more strictly that a field name can be used as a 
valid Java identifier for codegen
 Key: SPARK-31416
 URL: https://issues.apache.org/jira/browse/SPARK-31416
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


ScalaReflection checks that a field name can be used as a valid Java identifier 
by checking whether the field name is not a reserved keyword.

But, in the current implementation, enum is missed.
Further, some characters including numeric literals are not used as valid 
identifiers but not checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31404) backward compatibility issues after switching to Proleptic Gregorian calendar

2020-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31404:
--
Priority: Blocker  (was: Major)

> backward compatibility issues after switching to Proleptic Gregorian calendar
> -
>
> Key: SPARK-31404
> URL: https://issues.apache.org/jira/browse/SPARK-31404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java 
> 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but 
> introduces some backward compatibility problems:
> 1. may read wrong data from the data files written by Spark 2.4
> 2. may have perf regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31404) backward compatibility issues after switching to Proleptic Gregorian calendar

2020-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31404:
--
Target Version/s: 3.0.0

> backward compatibility issues after switching to Proleptic Gregorian calendar
> -
>
> Key: SPARK-31404
> URL: https://issues.apache.org/jira/browse/SPARK-31404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java 
> 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but 
> introduces some backward compatibility problems:
> 1. may read wrong data from the data files written by Spark 2.4
> 2. may have perf regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31415) builtin date-time functions/operations improvement

2020-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31415:
-

Assignee: Maxim Gekk

> builtin date-time functions/operations improvement
> --
>
> Key: SPARK-31415
> URL: https://issues.apache.org/jira/browse/SPARK-31415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0

2020-04-10 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-31306:


Assignee: Bryan Cutler

> rand() function documentation suggests an inclusive upper bound of 1.0
> --
>
> Key: SPARK-31306
> URL: https://issues.apache.org/jira/browse/SPARK-31306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Ben
>Assignee: Bryan Cutler
>Priority: Major
>
>  The rand() function in PySpark, Spark, and R is documented as drawing from 
> U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing 
> (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing 
> `U[0.0, 1.0]` suggests the value returned could include 1.0). The function 
> itself uses Rand(), which is [documented |#L71] as having a result in the 
> range [0, 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0

2020-04-10 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-31306:


Assignee: (was: Bryan Cutler)

> rand() function documentation suggests an inclusive upper bound of 1.0
> --
>
> Key: SPARK-31306
> URL: https://issues.apache.org/jira/browse/SPARK-31306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Ben
>Priority: Major
>
>  The rand() function in PySpark, Spark, and R is documented as drawing from 
> U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing 
> (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing 
> `U[0.0, 1.0]` suggests the value returned could include 1.0). The function 
> itself uses Rand(), which is [documented |#L71] as having a result in the 
> range [0, 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0

2020-04-10 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-31306.
--
Resolution: Fixed

Issue resolved by pull request 28071
https://github.com/apache/spark/pull/28071

> rand() function documentation suggests an inclusive upper bound of 1.0
> --
>
> Key: SPARK-31306
> URL: https://issues.apache.org/jira/browse/SPARK-31306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Ben
>Priority: Major
>
>  The rand() function in PySpark, Spark, and R is documented as drawing from 
> U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing 
> (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing 
> `U[0.0, 1.0]` suggests the value returned could include 1.0). The function 
> itself uses Rand(), which is [documented |#L71] as having a result in the 
> range [0, 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31399) closure cleaner is broken in Scala 2.12

2020-04-10 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080627#comment-17080627
 ] 

Dongjoon Hyun commented on SPARK-31399:
---

According to the above information,
- I added the failure result in Apache Spark 2.4.5 with Scala 2.12 additionally 
into the JIRA description.
- Update the title from `in Spark 3.0` to `in Scala 2.12`.
- Add `2.4.5` as `Affected Version` (but not a target version)

> closure cleaner is broken in Scala 2.12
> ---
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}
> **Apache Spark 2.4.5 with Scala 2.12**
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureS

[jira] [Updated] (SPARK-31399) closure cleaner is broken in Scala 2.12

2020-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31399:
--
Affects Version/s: (was: 2.4.6)
   2.4.5

> closure cleaner is broken in Scala 2.12
> ---
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}
> **Apache Spark 2.4.5 with Scala 2.12**
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2

[jira] [Updated] (SPARK-31399) closure cleaner is broken in Scala 2.12

2020-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31399:
--
Affects Version/s: 2.4.6

> closure cleaner is broken in Scala 2.12
> ---
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 2.4.6
>Reporter: Wenchen Fan
>Priority: Blocker
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}
> **Apache Spark 2.4.5 with Scala 2.12**
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
>   at org.apache.spark.rdd.RDD.$an

[jira] [Updated] (SPARK-31399) closure cleaner is broken in Scala 2.12

2020-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31399:
--
Description: 
The `ClosureCleaner` only support Scala functions and it uses the following 
check to catch closures
{code}
  // Check whether a class represents a Scala closure
  private def isClosure(cls: Class[_]): Boolean = {
cls.getName.contains("$anonfun$")
  }
{code}

This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
functions become Java lambdas.

As an example, the following code works well in Spark 2.4 Spark Shell:
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.functions.lit

case class Foo(id: String)
val col = lit("123")
val df = sc.range(0,10,1,1).map { _ => Foo("") }

// Exiting paste mode, now interpreting.

import org.apache.spark.sql.functions.lit
defined class Foo
col: org.apache.spark.sql.Column = 123
df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
{code}

But fails in 3.0
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.functions.lit

case class Foo(id: String)
val col = lit("123")
val df = sc.range(0,10,1,1).map { _ => Foo("") }

// Exiting paste mode, now interpreting.

org.apache.spark.SparkException: Task not serializable
  at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
  at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
  at org.apache.spark.rdd.RDD.map(RDD.scala:421)
  ... 39 elided
Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: 
123)
- field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
- object (class $iw, $iw@2d87ac2b)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, 
SerializedLambda[capturingClass=class $iw, 
functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
 implementation=invokeStatic 
$anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
  at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
  at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
  at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
  at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
  ... 47 more
{code}

**Apache Spark 2.4.5 with Scala 2.12**
{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.functions.lit

case class Foo(id: String)
val col = lit("123")
val df = sc.range(0,10,1,1).map { _ => Foo("") }

// Exiting paste mode, now interpreting.

org.apache.spark.SparkException: Task not serializable
  at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
  at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:393)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
  at org.apache.spark.rdd.RDD.map(RDD.scala:392)
  ... 45 elided
Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: 
123)
- field (class: $iw, name: col, type: class org.apache.spark.

[jira] [Updated] (SPARK-31399) closure cleaner is broken in Scala 2.12

2020-04-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31399:
--
Summary: closure cleaner is broken in Scala 2.12  (was: closure cleaner is 
broken in Spark 3.0)

> closure cleaner is broken in Scala 2.12
> ---
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite

2020-04-10 Thread Srinivas Rishindra Pothireddi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivas Rishindra Pothireddi updated SPARK-31377:
--
Description: 
For some combinations of join algorithm and join types there are no unit tests 
for the "number of output rows" metric.

A list of missing unit tests include the following.
 * SortMergeJoin: ExistenceJoin
 * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin
 * BroadcastNestedLoopJoin: RightOuter, InnerJoin, ExistenceJoin
 * BroadcastHashJoin: LeftAnti, ExistenceJoin

  was:
For some combinations of join algorithm and join types there are no unit tests 
for the "number of output rows" metric.

A list of missing unit tests include the following.
 * SortMergeJoin: ExistenceJoin
 * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin
 * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin
 * BroadcastHashJoin: LeftAnti, ExistenceJoin


> Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
> --
>
> Key: SPARK-31377
> URL: https://issues.apache.org/jira/browse/SPARK-31377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Minor
>
> For some combinations of join algorithm and join types there are no unit 
> tests for the "number of output rows" metric.
> A list of missing unit tests include the following.
>  * SortMergeJoin: ExistenceJoin
>  * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin
>  * BroadcastNestedLoopJoin: RightOuter, InnerJoin, ExistenceJoin
>  * BroadcastHashJoin: LeftAnti, ExistenceJoin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite

2020-04-10 Thread Srinivas Rishindra Pothireddi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivas Rishindra Pothireddi updated SPARK-31377:
--
Description: 
For some combinations of join algorithm and join types there are no unit tests 
for the "number of output rows" metric.

A list of missing unit tests include the following.
 * SortMergeJoin: ExistenceJoin
 * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin
 * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin
 * BroadcastHashJoin: LeftAnti, ExistenceJoin

  was:
For some combinations of join algorithm and join types there are no unit tests 
for the "number of output rows" metric.

A list of missing unit tests include the following.
 * SortMergeJoin: ExistenceJoin
 * ShuffledHashJoin: OuterJoin, leftOuter, RightOuter, LeftAnti, LeftSemi, 
ExistenseJoin
 * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin
 * BroadcastHashJoin: LeftAnti, ExistenceJoin


> Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
> --
>
> Key: SPARK-31377
> URL: https://issues.apache.org/jira/browse/SPARK-31377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Minor
>
> For some combinations of join algorithm and join types there are no unit 
> tests for the "number of output rows" metric.
> A list of missing unit tests include the following.
>  * SortMergeJoin: ExistenceJoin
>  * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin
>  * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin
>  * BroadcastHashJoin: LeftAnti, ExistenceJoin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30953) InsertAdaptiveSparkPlan should apply AQE on child plan of write commands

2020-04-10 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-30953:
-
Summary: InsertAdaptiveSparkPlan should apply AQE on child plan of write 
commands  (was: InsertAdaptiveSparkPlan should also skip v2 command)

> InsertAdaptiveSparkPlan should apply AQE on child plan of write commands
> 
>
> Key: SPARK-30953
> URL: https://issues.apache.org/jira/browse/SPARK-30953
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> InsertAdaptiveSparkPlan should also skip v2 command as we did for v1 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30953) InsertAdaptiveSparkPlan should apply AQE on child plan of write commands

2020-04-10 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-30953:
-
Description: Apply AQE on write commands with child plan will expose 
{{LogicalQueryStage}} to {{Analyzer}} while it should hider under 
{{AdaptiveSparkPlanExec}} only to avoid unexpected broken.  (was: 
InsertAdaptiveSparkPlan should also skip v2 command as we did for v1 command.)

> InsertAdaptiveSparkPlan should apply AQE on child plan of write commands
> 
>
> Key: SPARK-30953
> URL: https://issues.apache.org/jira/browse/SPARK-30953
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Apply AQE on write commands with child plan will expose {{LogicalQueryStage}} 
> to {{Analyzer}} while it should hider under {{AdaptiveSparkPlanExec}} only to 
> avoid unexpected broken.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31415) builtin date-time functions/operations improvement

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31415.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> builtin date-time functions/operations improvement
> --
>
> Key: SPARK-31415
> URL: https://issues.apache.org/jira/browse/SPARK-31415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27898) Support 4 date operators(date + integer, integer + date, date - integer and date - date)

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-27898:

Parent Issue: SPARK-31415  (was: SPARK-27764)

> Support 4 date operators(date + integer, integer + date, date - integer and 
> date - date)
> 
>
> Key: SPARK-27898
> URL: https://issues.apache.org/jira/browse/SPARK-27898
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Support 4 date operators(date + integer, integer + date, date - integer and 
> date - date):
> ||Operator||Example||Result||
> |+|date '2001-09-28' + integer '7'|date '2001-10-05'|
> |-|date '2001-10-01' - integer '7'|date '2001-09-24'|
> |-|date '2001-10-01' - date '2001-09-28'|integer '3' (days)|
> [https://www.postgresql.org/docs/12/functions-datetime.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29387) Support `*` and `/` operators for intervals

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29387:

Parent Issue: SPARK-31415  (was: SPARK-27764)

> Support `*` and `/` operators for intervals
> ---
>
> Key: SPARK-29387
> URL: https://issues.apache.org/jira/browse/SPARK-29387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Support `*` by numeric, `/` by numeric. See 
> [https://www.postgresql.org/docs/12/functions-datetime.html]
> ||Operator||Example||Result||
> |*|900 * interval '1 second'|interval '00:15:00'|
> |*|21 * interval '1 day'|interval '21 days'|
> |/|interval '1 hour' / double precision '1.5'|interval '00:40:00'|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29365) Support date and timestamp subtraction

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29365:

Parent Issue: SPARK-31415  (was: SPARK-27764)

> Support date and timestamp subtraction 
> ---
>
> Key: SPARK-29365
> URL: https://issues.apache.org/jira/browse/SPARK-29365
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> PostgreSQL supports subtraction dates and timestamps:
> {code}
> maxim=# select date'2020-01-01' - timestamp'2019-10-06 10:11:12.345678';
> ?column? 
> -
>  86 days 13:48:47.654322
> (1 row)
> maxim=# select timestamp'2019-10-06 10:11:12.345678' - date'2020-01-01';
>  ?column?  
> ---
>  -86 days -13:48:47.654322
> (1 row)
> {code}
> Need to provide the same feature in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31415) builtin date-time functions/operations improvement

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31415:

Summary: builtin date-time functions/operations improvement  (was: builtin 
date-time functions improvement)

> builtin date-time functions/operations improvement
> --
>
> Key: SPARK-31415
> URL: https://issues.apache.org/jira/browse/SPARK-31415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29774) Date and Timestamp type +/- null should be null as Postgres

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29774:

Parent Issue: SPARK-31415  (was: SPARK-27764)

> Date and Timestamp type +/- null should be null as Postgres
> ---
>
> Key: SPARK-29774
> URL: https://issues.apache.org/jira/browse/SPARK-29774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code:sql}
> postgres=# select timestamp '1999-12-31' - null;
>  ?column?
> --
> (1 row)
> postgres=# select date '1999-12-31' - null;
>  ?column?
> --
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29486) CalendarInterval should have 3 fields: months, days and microseconds

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29486:

Parent: SPARK-31415
Issue Type: Sub-task  (was: Bug)

> CalendarInterval should have 3 fields: months, days and microseconds
> 
>
> Key: SPARK-29486
> URL: https://issues.apache.org/jira/browse/SPARK-29486
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Liu, Linhong
>Assignee: Liu, Linhong
>Priority: Major
> Fix For: 3.0.0
>
>
> Current CalendarInterval has 2 fields: months and microseconds. This PR try 
> to change it
> to 3 fields: months, days and microseconds. This is because one logical day 
> interval may
> have different number of microseconds (daylight saving).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29364) Return intervals from date subtracts

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29364:

Parent: SPARK-31415
Issue Type: Sub-task  (was: Improvement)

> Return intervals from date subtracts
> 
>
> Key: SPARK-29364
> URL: https://issues.apache.org/jira/browse/SPARK-29364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> According to the SQL standard, date1 - date2 is an interval. See 
> http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt at +_4.5.3  
> Operations involving datetimes and intervals_+. The ticket aims to modify the 
> DateDiff expression to produce another expression of the INTERVAL type When 
> wspark.sql.ansi.enabled` is set to *true* and `spark.sql.dialect` is *Spark*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28420) Date/Time Functions: date_part for intervals

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28420:

Parent Issue: SPARK-31415  (was: SPARK-30375)

> Date/Time Functions: date_part for intervals
> 
>
> Key: SPARK-28420
> URL: https://issues.apache.org/jira/browse/SPARK-28420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> ||Function||Return Type||Description||Example||Result||
> |{{date_part(}}{{text}}{{, }}{{interval}}{{)}}|{{double precision}}|Get 
> subfield (equivalent to {{extract}}); see [Section 
> 9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('month',
>  interval '2 years 3 months')}}|{{3}}|
> We can replace it with {{extract(field from timestamp)}}.
> https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29355) Support timestamps subtraction

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29355:

Parent Issue: SPARK-31415  (was: SPARK-27764)

> Support timestamps subtraction
> --
>
> Key: SPARK-29355
> URL: https://issues.apache.org/jira/browse/SPARK-29355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> ||Operator||Example||Result||
> |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
> 12:00'}}|{{interval '1 day 15:00:00'}}|
> https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29187) Return null from `date_part()` for the null `field`

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29187:

Parent Issue: SPARK-31415  (was: SPARK-30375)

> Return null from `date_part()` for the null `field`
> ---
>
> Key: SPARK-29187
> URL: https://issues.apache.org/jira/browse/SPARK-29187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> PostgreSQL return NULL for the NULL field from the date_part() function:
> {code}
> maxim=# select date_part(null, date'2019-09-20');
>  date_part 
> ---
>   
> (1 row)
> {code}
> but Spark fails with the error:
> {code}
> spark-sql> select date_part(null, date'2019-09-20');
> Error in query: null; line 1 pos 7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28141) Date type can not accept special values

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28141:

Parent Issue: SPARK-31415  (was: SPARK-27764)

> Date type can not accept special values
> ---
>
> Key: SPARK-28141
> URL: https://issues.apache.org/jira/browse/SPARK-28141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> ||Input String||Valid Types||Description||
> |{{epoch}}|{{date}}|1970-01-01 00:00:00+00 (Unix system time zero)|
> |{{now}}|{{date}}|current transaction's start time|
> |{{today}}|{{date}}|midnight today|
> |{{tomorrow}}|{{date}}|midnight tomorrow|
> |{{yesterday}}|{{date}}|midnight yesterday|
> https://www.postgresql.org/docs/12/datatype-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29012) Timestamp type can not accept special values

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29012:

Parent Issue: SPARK-31415  (was: SPARK-27764)

> Timestamp type can not accept special values
> 
>
> Key: SPARK-29012
> URL: https://issues.apache.org/jira/browse/SPARK-29012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> ||Input String||Valid Types||Description||
> |{{epoch}}|{{timestamp}}|1970-01-01 00:00:00+00 (Unix system time zero)|
> |{{now}}|{{timestamp}}|current transaction's start time|
> |{{today}}|{{timestamp}}|midnight today|
> |{{tomorrow}}|{{timestamp}}|midnight tomorrow|
> |{{yesterday}}|{{timestamp}}|midnight yesterday|
> https://www.postgresql.org/docs/12/datatype-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28690) Date/Time Functions: date_part for timestamps

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28690:

Parent Issue: SPARK-31415  (was: SPARK-30375)

> Date/Time Functions: date_part for timestamps
> -
>
> Key: SPARK-28690
> URL: https://issues.apache.org/jira/browse/SPARK-28690
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> ||Function||Return Type||Description||Example||Result||
> |{{date_part(}}{{text}}{{, }}{{timestamp}}{{)}}|{{double precision}}|Get 
> subfield (equivalent to {{extract}}); see [Section 
> 9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('hour',
>  timestamp '2001-02-16 20:38:40')}}|{{20}}|
> We can replace it with {{extract(field from timestamp)}}.
> https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28687) Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()`

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28687:

Parent Issue: SPARK-31415  (was: SPARK-30375)

> Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()`
> 
>
> Key: SPARK-28687
> URL: https://issues.apache.org/jira/browse/SPARK-28687
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, we support these field for EXTRACT: CENTURY, MILLENNIUM, DECADE, 
> YEAR, QUARTER, MONTH, WEEK, DAY, DAYOFWEEK, HOUR, MINUTE, SECOND, DOW, 
> ISODOW, DOY, 
> We also need support: EPOCH, MICROSECONDS, MILLISECONDS, TIMEZONE, 
> TIMEZONE_M, TIMEZONE_H, ISOYEAR.
> https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28700) make_timestamp loses seconds fractions

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28700:

Parent: SPARK-31415
Issue Type: Sub-task  (was: Bug)

> make_timestamp loses seconds fractions
> --
>
> Key: SPARK-28700
> URL: https://issues.apache.org/jira/browse/SPARK-28700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> The make_timestamp() function accepts seconds with fractional part up to 
> microseconds. In some cases, it can lose precision in the fractional part. 
> For example:
> {code}
> spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.01);
> 2019-08-12 00:00:58
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28656) Support `millennium`, `century` and `decade` at `extract()`

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28656:

Parent Issue: SPARK-31415  (was: SPARK-30375)

> Support `millennium`, `century` and `decade` at `extract()`
> ---
>
> Key: SPARK-28656
> URL: https://issues.apache.org/jira/browse/SPARK-28656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, we support these field for EXTRACT: YEAR, QUARTER, MONTH, WEEK, 
> DAY, DAYOFWEEK, HOUR, MINUTE, SECOND.
> We also need support: EPOCH, CENTURY, MILLENNIUM, DECADE, MICROSECONDS, 
> MILLISECONDS, DOW, ISODOW, DOY, TIMEZONE, TIMEZONE_M, TIMEZONE_H, JULIAN, 
> ISOYEAR.
> https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28017) Enhance DATE_TRUNC

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28017:

Parent Issue: SPARK-31415  (was: SPARK-30375)

> Enhance DATE_TRUNC
> --
>
> Key: SPARK-28017
> URL: https://issues.apache.org/jira/browse/SPARK-28017
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> For DATE_TRUNC, we need support: microseconds, milliseconds, decade, century, 
> millennium.
> https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28459) Date/Time Functions: make_timestamp

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28459:

Parent Issue: SPARK-31415  (was: SPARK-30375)

> Date/Time Functions: make_timestamp
> ---
>
> Key: SPARK-28459
> URL: https://issues.apache.org/jira/browse/SPARK-28459
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> ||Function||Return Type||Description||Example||Result||
> |{{make_timestamp(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ }}{{int}}{{, 
> _hour_ }}{{int}}{{, _min_ }}{{int}}{{, _sec_}}{{double 
> precision}}{{)}}|{{timestamp}}|Create timestamp from year, month, day, hour, 
> minute and seconds fields|{{make_timestamp(2013, 7, 15, 8, 15, 
> 23.5)}}|{{2013-07-15 08:15:23.5}}|
> https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28432) Add `make_date` function

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28432:

Parent Issue: SPARK-31415  (was: SPARK-30375)

> Add `make_date` function
> 
>
> Key: SPARK-28432
> URL: https://issues.apache.org/jira/browse/SPARK-28432
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> ||Function||Return Type||Description||Example||Result||
> |{{make_date(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ 
> }}{{int}}{{)}}|{{date}}|Create date from year, month and day 
> fields|{{make_date(2013, 7, 15)}}|{{2013-07-15}}|
> https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31415) builtin date-time functions improvement

2020-04-10 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31415:
---

 Summary: builtin date-time functions improvement
 Key: SPARK-31415
 URL: https://issues.apache.org/jira/browse/SPARK-31415
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31121) catalog plugin API

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31121:

Summary: catalog plugin API  (was: Spark API for Table Metadata)

> catalog plugin API
> --
>
> Key: SPARK-31121
> URL: https://issues.apache.org/jira/browse/SPARK-31121
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> Details please see the SPIP doc: 
> https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d
> This will bring multi-catalog support to Spark and allow external catalog 
> implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27576) table capabilty to skip the output column resolution

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-27576:

Parent: SPARK-25390
Issue Type: Sub-task  (was: Improvement)

> table capabilty to skip the output column resolution
> 
>
> Key: SPARK-27576
> URL: https://issues.apache.org/jira/browse/SPARK-27576
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27521) move data source v2 API to catalyst module

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-27521:

Parent: SPARK-25390
Issue Type: Sub-task  (was: Improvement)

> move data source v2 API to catalyst module
> --
>
> Key: SPARK-27521
> URL: https://issues.apache.org/jira/browse/SPARK-27521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26811) Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis.

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-26811:

Parent: SPARK-25390
Issue Type: Sub-task  (was: Bug)

> Add DataSourceV2 capabilities to check support for batch append, overwrite, 
> truncate during analysis.
> -
>
> Key: SPARK-26811
> URL: https://issues.apache.org/jira/browse/SPARK-26811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27190) Add DataSourceV2 capabilities for streaming

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-27190:

Parent: SPARK-25390
Issue Type: Sub-task  (was: Improvement)

> Add DataSourceV2 capabilities for streaming
> ---
>
> Key: SPARK-27190
> URL: https://issues.apache.org/jira/browse/SPARK-27190
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24882) data source v2 API improvement

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24882:

Parent Issue: SPARK-25390  (was: SPARK-22386)

> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull part of the API, think of better naming 
> of some interfaces. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31406) ThriftServerQueryTestSuite: Sharing test data and test tables among multiple test cases.

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31406.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28180
[https://github.com/apache/spark/pull/28180]

> ThriftServerQueryTestSuite: Sharing test data and test tables among multiple 
> test cases.
> 
>
> Key: SPARK-31406
> URL: https://issues.apache.org/jira/browse/SPARK-31406
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Minor
> Fix For: 3.0.0
>
>
> ThriftServerQueryTestSuite spend 17 minutes time to test.
> I checked the code and found ThriftServerQueryTestSuite load test data 
> repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31406) ThriftServerQueryTestSuite: Sharing test data and test tables among multiple test cases.

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31406:
---

Assignee: jiaan.geng

> ThriftServerQueryTestSuite: Sharing test data and test tables among multiple 
> test cases.
> 
>
> Key: SPARK-31406
> URL: https://issues.apache.org/jira/browse/SPARK-31406
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Minor
>
> ThriftServerQueryTestSuite spend 17 minutes time to test.
> I checked the code and found ThriftServerQueryTestSuite load test data 
> repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31124) change the default value of minPartitionNum in AQE

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31124:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> change the default value of minPartitionNum in AQE
> --
>
> Key: SPARK-31124
> URL: https://issues.apache.org/jira/browse/SPARK-31124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31414) Performance regression with new TimestampFormatter for json and csv

2020-04-10 Thread Kent Yao (Jira)
Kent Yao created SPARK-31414:


 Summary: Performance regression with new TimestampFormatter for 
json and csv
 Key: SPARK-31414
 URL: https://issues.apache.org/jira/browse/SPARK-31414
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


with benchmark original, where the timestamp values are valid to new parser

the result is 

{code:java}
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5781 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 44764 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 93764 ms
[info]   Running case: from_json(timestamp)
[info]   Stopped after 3 iterations, 59021 ms
{code}

when we modify the benchmark to 

{code:java}
  def timestampStr: Dataset[String] = {
spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
  iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""")
}.select($"value".as("timestamp")).as[String]
  }

  readBench.addCase("timestamp strings", numIters) { _ =>
timestampStr.noop()
  }

  readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ 
=>
spark.read.schema(tsSchema).json(timestampStr).noop()
  }

  readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ 
=>
spark.read.json(timestampStr).noop()
  }
{code}

where the timestamp values are invalid for the new parser which cause fallback 
to legacy parser.
the result is 

{code:java}
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5623 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 506637 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 509076 ms
{code}

About 10x perf-regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter

2020-04-10 Thread varun senthilnathan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

varun senthilnathan updated SPARK-31413:

Description: 
We are using spark 2.4.3 in java. We would like to log the partition key and 
the sequence number of every event. The overloaded create stream function of 
the kinesis utils always throws a compilation error.

 

Function printSeq = s -> s;
 KinesisUtils.createStream(
 jssc,
 appName,
 streamName,
 endPointUrl,
 regionName,
 InitialPositionInStream.TRIM_HORIZON,
 kinesisCheckpointInterval,
 StorageLevel.MEMORY_AND_DISK_SER(),
 printSeq,
 Record.class);

*{color:#172b4d}Record is of type 
com.amazonaws.services.kinesis.model.Record{color}*

 

The exception is as follows:
{quote}no suitable method found for 
createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
{quote}
JAVA DOCS : 
[https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]

Is there a way out?

  was:
We are using spark 2.4.3 in java. We would like to log the partition key and 
the sequence number of every event. The overloaded create stream function of 
the kinesis utils always throws a compilation error.

 

Function printSeq = s -> s;
 KinesisUtils.createStream(
 jssc,
 appName,
 streamName,
 endPointUrl,
 regionName,
 InitialPositionInStream.TRIM_HORIZON,
 kinesisCheckpointInterval,
 StorageLevel.MEMORY_AND_DISK_SER(),
 printSeq,
 Record.class);

Record is of type com.amazonaws.services.kinesis.model.Record

 

The exception is as follows:
{quote}no suitable method found for 
createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
{quote}
JAVA DOCS : 
[https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]

Is there a way out?


> Accessing the sequence number and partition id for records in Kinesis adapter
> -
>
> Key: SPARK-31413
> URL: https://issues.apache.org/jira/browse/SPARK-31413
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: varun senthilnathan
>Priority: Critical
>
> We are using spark 2.4.3 in java. We would like to log the partition key and 
> the sequence number of every event. The overloaded create stream function of 
> the kinesis utils always throws a compilation error.
>  
> Function printSeq = s -> s;
>  KinesisUtils.createStream(
>  jssc,
>  appName,
>  streamName,
>  endPointUrl,
>  regionName,
>  InitialPositionInStream.TRIM_HORIZON,
>  kinesisCheckpointInterval,
>  StorageLevel.MEMORY_AND_DISK_SER(),
>  printSeq,
>  Record.class);
> *{color:#172b4d}Record is of type 
> com.amazonaws.services.kinesis.model.Record{color}*
>  
> The exception is as follows:
> {quote}no suitable method found for 
> createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
> {quote}
> JAVA DOCS : 
> [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apach

[jira] [Resolved] (SPARK-31412) New Adaptive Query Execution in Spark SQL

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31412.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> New Adaptive Query Execution in Spark SQL
> -
>
> Key: SPARK-31412
> URL: https://issues.apache.org/jira/browse/SPARK-31412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter

2020-04-10 Thread varun senthilnathan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

varun senthilnathan updated SPARK-31413:

Description: 
We are using spark 2.4.3 in java. We would like to log the partition key and 
the sequence number of every event. The overloaded create stream function of 
the kinesis utils always throws a compilation error.

 

Function printSeq = s -> s;
 KinesisUtils.createStream(
 jssc,
 appName,
 streamName,
 endPointUrl,
 regionName,
 InitialPositionInStream.TRIM_HORIZON,
 kinesisCheckpointInterval,
 StorageLevel.MEMORY_AND_DISK_SER(),
 printSeq,
 Record.class);

Record is of type com.amazonaws.services.kinesis.model.Record

 

The exception is as follows:
{quote}no suitable method found for 
createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
{quote}
JAVA DOCS : 
[https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]

Is there a way out?

  was:
We are using spark 2.4.3 in java. We would like to log the partition key and 
the sequence number of every event. The overloaded create stream function of 
the kinesis utils always throws a compilation error.

 

Function printSeq = s -> s;
 KinesisUtils.createStream(
 jssc,
 appName,
 streamName,
 endPointUrl,
 regionName,
 InitialPositionInStream.TRIM_HORIZON,
 kinesisCheckpointInterval,
 StorageLevel.MEMORY_AND_DISK_SER(),
 printSeq,
 Record.class);

The exception is as follows:
{quote}no suitable method found for 
createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
{quote}
JAVA DOCS : 
[https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]

Is there a way out?


> Accessing the sequence number and partition id for records in Kinesis adapter
> -
>
> Key: SPARK-31413
> URL: https://issues.apache.org/jira/browse/SPARK-31413
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: varun senthilnathan
>Priority: Critical
>
> We are using spark 2.4.3 in java. We would like to log the partition key and 
> the sequence number of every event. The overloaded create stream function of 
> the kinesis utils always throws a compilation error.
>  
> Function printSeq = s -> s;
>  KinesisUtils.createStream(
>  jssc,
>  appName,
>  streamName,
>  endPointUrl,
>  regionName,
>  InitialPositionInStream.TRIM_HORIZON,
>  kinesisCheckpointInterval,
>  StorageLevel.MEMORY_AND_DISK_SER(),
>  printSeq,
>  Record.class);
> Record is of type com.amazonaws.services.kinesis.model.Record
>  
> The exception is as follows:
> {quote}no suitable method found for 
> createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
> {quote}
> JAVA DOCS : 
> [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]
> Is there a way out?



--
This message was sent by Atlassian Ji

[jira] [Updated] (SPARK-30864) Add the user guide for Adaptive Query Execution

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30864:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Add the user guide for Adaptive Query Execution
> ---
>
> Key: SPARK-30864
> URL: https://issues.apache.org/jira/browse/SPARK-30864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> This Jira will add the detail user guide to describe how to enable AQE and 
> the three mainly features.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30918) improve the splitting of skewed partitions

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30918:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> improve the splitting of skewed partitions
> --
>
> Key: SPARK-30918
> URL: https://issues.apache.org/jira/browse/SPARK-30918
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28560:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Optimize shuffle reader to local shuffle reader when smj converted to bhj in 
> adaptive execution
> ---
>
> Key: SPARK-28560
> URL: https://issues.apache.org/jira/browse/SPARK-28560
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> Implement a rule in the new adaptive execution framework introduced in 
> SPARK-23128. This rule is used to optimize the shuffle reader to local 
> shuffle reader when smj is converted to bhj in adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29893) Improve the local reader performance by changing the task number from 1 to multi

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29893:

Parent: (was: SPARK-28560)
Issue Type: Improvement  (was: Sub-task)

> Improve the local reader performance by changing the task number from 1 to 
> multi
> 
>
> Key: SPARK-29893
> URL: https://issues.apache.org/jira/browse/SPARK-29893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> The currently local reader read all the partition of map stage only using 1 
> task, which may cause the performance degradation. This PR will improve the 
> performance by using multi tasks instead of one task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter

2020-04-10 Thread varun senthilnathan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

varun senthilnathan updated SPARK-31413:

Environment: (was: We are using spark 2.4.3 in java. We would like to 
log the partition key and the sequence number of every event. The overloaded 
create stream function of the kinesis utils always throws a compilation error.

 

{{Function printSeq = s -> s;}}

{{KinesisUtils.createStream(}}jssc, appName, streamName, endPointUrl, 
regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, 
StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class);

The exception is as follows:
{quote}no suitable method found for 
createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
{quote}
JAVA DOCS: 
[https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]

Is there a way out?)

> Accessing the sequence number and partition id for records in Kinesis adapter
> -
>
> Key: SPARK-31413
> URL: https://issues.apache.org/jira/browse/SPARK-31413
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: varun senthilnathan
>Priority: Critical
>
> We are using spark 2.4.3 in java. We would like to log the partition key and 
> the sequence number of every event. The overloaded create stream function of 
> the kinesis utils always throws a compilation error.
>  
> Function printSeq = s -> s;
>  KinesisUtils.createStream(
>  jssc,
>  appName,
>  streamName,
>  endPointUrl,
>  regionName,
>  InitialPositionInStream.TRIM_HORIZON,
>  kinesisCheckpointInterval,
>  StorageLevel.MEMORY_AND_DISK_SER(),
>  printSeq,
>  Record.class);
> The exception is as follows:
> {quote}no suitable method found for 
> createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
> {quote}
> JAVA DOCS : 
> [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]
> Is there a way out?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter

2020-04-10 Thread varun senthilnathan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

varun senthilnathan updated SPARK-31413:

Description: 
We are using spark 2.4.3 in java. We would like to log the partition key and 
the sequence number of every event. The overloaded create stream function of 
the kinesis utils always throws a compilation error.

 

Function printSeq = s -> s;
 KinesisUtils.createStream(
 jssc,
 appName,
 streamName,
 endPointUrl,
 regionName,
 InitialPositionInStream.TRIM_HORIZON,
 kinesisCheckpointInterval,
 StorageLevel.MEMORY_AND_DISK_SER(),
 printSeq,
 Record.class);

The exception is as follows:
{quote}no suitable method found for 
createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
{quote}
JAVA DOCS : 
[https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]

Is there a way out?

  was:
We are using spark 2.4.3 in java. We would like to log the partition key and 
the sequence number of every event. The overloaded create stream function of 
the kinesis utils always throws a compilation error.

 

{{Function printSeq = s -> s;
KinesisUtils.createStream(
  jssc,
  appName,
  streamName,
  endPointUrl,
  regionName,
  InitialPositionInStream.TRIM_HORIZON,
  kinesisCheckpointInterval,
  StorageLevel.MEMORY_AND_DISK_SER(),
  printSeq,
  Record.class);}}

The exception is as follows:
{quote}no suitable method found for 
createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
{quote}
JAVA DOCS : 
[https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]

Is there a way out?


> Accessing the sequence number and partition id for records in Kinesis adapter
> -
>
> Key: SPARK-31413
> URL: https://issues.apache.org/jira/browse/SPARK-31413
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.3
> Environment: We are using spark 2.4.3 in java. We would like to log 
> the partition key and the sequence number of every event. The overloaded 
> create stream function of the kinesis utils always throws a compilation error.
>  
> {{Function printSeq = s -> s;}}
> {{KinesisUtils.createStream(}}jssc, appName, streamName, endPointUrl, 
> regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, 
> StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class);
> The exception is as follows:
> {quote}no suitable method found for 
> createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
> {quote}
> JAVA DOCS: 
> [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]
> Is there a way out?
>Reporter: varun senthilnathan
>Priority: Critical
>
> We are using spark 2.4.3 in java. We would like to log the partition key and 
> the sequence number of every event. The overloaded create stream function of 
> 

[jira] [Updated] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter

2020-04-10 Thread varun senthilnathan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

varun senthilnathan updated SPARK-31413:

Description: 
We are using spark 2.4.3 in java. We would like to log the partition key and 
the sequence number of every event. The overloaded create stream function of 
the kinesis utils always throws a compilation error.

 

{{Function printSeq = s -> s;
KinesisUtils.createStream(
  jssc,
  appName,
  streamName,
  endPointUrl,
  regionName,
  InitialPositionInStream.TRIM_HORIZON,
  kinesisCheckpointInterval,
  StorageLevel.MEMORY_AND_DISK_SER(),
  printSeq,
  Record.class);}}

The exception is as follows:
{quote}no suitable method found for 
createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
{quote}
JAVA DOCS : 
[https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]

Is there a way out?

  was:We 


> Accessing the sequence number and partition id for records in Kinesis adapter
> -
>
> Key: SPARK-31413
> URL: https://issues.apache.org/jira/browse/SPARK-31413
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.3
> Environment: We are using spark 2.4.3 in java. We would like to log 
> the partition key and the sequence number of every event. The overloaded 
> create stream function of the kinesis utils always throws a compilation error.
>  
> {{Function printSeq = s -> s;}}
> {{KinesisUtils.createStream(}}jssc, appName, streamName, endPointUrl, 
> regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, 
> StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class);
> The exception is as follows:
> {quote}no suitable method found for 
> createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
> {quote}
> JAVA DOCS: 
> [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]
> Is there a way out?
>Reporter: varun senthilnathan
>Priority: Critical
>
> We are using spark 2.4.3 in java. We would like to log the partition key and 
> the sequence number of every event. The overloaded create stream function of 
> the kinesis utils always throws a compilation error.
>  
> {{Function printSeq = s -> s;
> KinesisUtils.createStream(
>   jssc,
>   appName,
>   streamName,
>   endPointUrl,
>   regionName,
>   InitialPositionInStream.TRIM_HORIZON,
>   kinesisCheckpointInterval,
>   StorageLevel.MEMORY_AND_DISK_SER(),
>   printSeq,
>   Record.class);}}
> The exception is as follows:
> {quote}no suitable method found for 
> createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
> {quote}
> JAVA DOCS : 
> [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]
> Is there a way out?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Updated] (SPARK-28739) Add a simple cost check for Adaptive Query Execution

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28739:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Add a simple cost check for Adaptive Query Execution
> 
>
> Key: SPARK-28739
> URL: https://issues.apache.org/jira/browse/SPARK-28739
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>
> Add a mechanism to compare the costs of the before and after plans of 
> re-optimization in Adaptive Query Execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter

2020-04-10 Thread varun senthilnathan (Jira)
varun senthilnathan created SPARK-31413:
---

 Summary: Accessing the sequence number and partition id for 
records in Kinesis adapter
 Key: SPARK-31413
 URL: https://issues.apache.org/jira/browse/SPARK-31413
 Project: Spark
  Issue Type: Question
  Components: DStreams
Affects Versions: 2.4.3
 Environment: We are using spark 2.4.3 in java. We would like to log 
the partition key and the sequence number of every event. The overloaded create 
stream function of the kinesis utils always throws a compilation error.

 

{{Function printSeq = s -> s;}}

{{KinesisUtils.createStream(}}jssc, appName, streamName, endPointUrl, 
regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, 
StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class);

The exception is as follows:
{quote}no suitable method found for 
createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
{quote}
JAVA DOCS: 
[https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]

Is there a way out?
Reporter: varun senthilnathan


We 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28753) Dynamically reuse subqueries in AQE

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28753:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Dynamically reuse subqueries in AQE
> ---
>
> Key: SPARK-28753
> URL: https://issues.apache.org/jira/browse/SPARK-28753
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31037) refine AQE config names

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31037:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> refine AQE config names
> ---
>
> Key: SPARK-31037
> URL: https://issues.apache.org/jira/browse/SPARK-31037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31070) make skew join split skewed partitions more evenly

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31070:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> make skew join split skewed partitions more evenly
> --
>
> Key: SPARK-31070
> URL: https://issues.apache.org/jira/browse/SPARK-31070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31134) optimize skew join after shuffle partitions are coalesced

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31134:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> optimize skew join after shuffle partitions are coalesced
> -
>
> Key: SPARK-31134
> URL: https://issues.apache.org/jira/browse/SPARK-31134
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31253) add metrics to shuffle reader

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31253:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> add metrics to shuffle reader
> -
>
> Key: SPARK-31253
> URL: https://issues.apache.org/jira/browse/SPARK-31253
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31201) add an individual config for skewed partition threshold

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31201:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> add an individual config for skewed partition threshold
> ---
>
> Key: SPARK-31201
> URL: https://issues.apache.org/jira/browse/SPARK-31201
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31253) add metrics to AQE shuffle reader

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31253:

Summary: add metrics to AQE shuffle reader  (was: add metrics to shuffle 
reader)

> add metrics to AQE shuffle reader
> -
>
> Key: SPARK-31253
> URL: https://issues.apache.org/jira/browse/SPARK-31253
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29544) Optimize skewed join at runtime with new Adaptive Execution

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29544:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Optimize skewed join at runtime with new Adaptive Execution
> ---
>
> Key: SPARK-29544
> URL: https://issues.apache.org/jira/browse/SPARK-29544
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Skewed Join Optimization Design Doc.docx
>
>
> Implement a rule in the new adaptive execution framework introduced in 
> [SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is 
> used to handle the skew join optimization based on the runtime statistics 
> (data size and row count).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30524) Disable OptimizeSkewJoin rule if introducing additional shuffle.

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30524:

Parent Issue: SPARK-31412  (was: SPARK-29544)

> Disable OptimizeSkewJoin rule if introducing additional shuffle.
> 
>
> Key: SPARK-30524
> URL: https://issues.apache.org/jira/browse/SPARK-30524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> The OptimizeSkewedJoin will break the outputPartitioning of origin SMJ. And 
> it may introduce additional shuffle after apply the OptimizeSkewedJoin. This 
> PR will disable "OptimizeSkewedJoin" rule if introducing additional shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29954) collect the runtime statistics of row count in map stage

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29954:

Parent Issue: SPARK-31412  (was: SPARK-29544)

> collect the runtime statistics of row count in map stage
> 
>
> Key: SPARK-29954
> URL: https://issues.apache.org/jira/browse/SPARK-29954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> We need the row count info to more accurately estimate the data skew 
> situation when too many duplicated data. This PR will collect the row count 
> info in map stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28177) Adjust post shuffle partition number in adaptive execution

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28177:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Adjust post shuffle partition number in adaptive execution
> --
>
> Key: SPARK-28177
> URL: https://issues.apache.org/jira/browse/SPARK-28177
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Implement a rule in the new adaptive execution framework introduced in 
> SPARK-23128. This rule is used to adjust the post shuffle partitions based on 
> the map output statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31412) New Adaptive Query Execution in Spark SQL

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31412:

Summary: New Adaptive Query Execution in Spark SQL  (was: Adaptive Query 
Execution in Spark SQL)

> New Adaptive Query Execution in Spark SQL
> -
>
> Key: SPARK-31412
> URL: https://issues.apache.org/jira/browse/SPARK-31412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31412) Adaptive Query Execution in Spark SQL

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31412:

Description: 
SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
DAGScheduler, a new API is added to support submitting a single map stage.  The 
current implementation of adaptive execution in Spark SQL supports changing the 
reducer number at runtime. An Exchange coordinator is used to determine the 
number of post-shuffle partitions for a stage that needs to fetch shuffle data 
from one or multiple stages. The current implementation adds 
ExchangeCoordinator while we are adding Exchanges. However there are some 
limitations. First, it may cause additional shuffles that may decrease the 
performance. We can see this from EnsureRequirements rule when it adds 
ExchangeCoordinator.  Secondly, it is not a good idea to add 
ExchangeCoordinators while we are adding Exchanges because we don’t have a 
global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 3 
tables’ join in a single stage, the same ExchangeCoordinator should be used in 
three Exchanges but currently two separated ExchangeCoordinator will be added. 
Thirdly, with the current framework it is not easy to implement other features 
in adaptive execution flexibly like changing the execution plan and handling 
skewed join at runtime.

We'd like to introduce a new way to do adaptive execution in Spark SQL and 
address the limitations. The idea is described at 
[https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]

> Adaptive Query Execution in Spark SQL
> -
>
> Key: SPARK-31412
> URL: https://issues.apache.org/jira/browse/SPARK-31412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23128) The basic framework for the new Adaptive Query Execution

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23128:

Summary: The basic framework for the new Adaptive Query Execution  (was: A 
new approach to do adaptive execution in Spark SQL)

> The basic framework for the new Adaptive Query Execution
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2020-04-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23128:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31412) Adaptive Query Execution in Spark SQL

2020-04-10 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31412:
---

 Summary: Adaptive Query Execution in Spark SQL
 Key: SPARK-31412
 URL: https://issues.apache.org/jira/browse/SPARK-31412
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >