date:20210314

[jira] [Assigned] (SPARK-34735) Add modified configs for SQL execution in UI

2021-03-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34735:


Assignee: Apache Spark

> Add modified configs for SQL execution in UI
> 
>
> Key: SPARK-34735
> URL: https://issues.apache.org/jira/browse/SPARK-34735
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
> Attachments: sql-ui.jpg
>
>
> For SQL user, it's very common to add some config to optimize sql. Within a 
> script, it would look like this
> {code:java}
> set k1=v1;
> set k2=v2;
> set ...
> INSERT INTO TABLE t1
> SELECT ...
> {code}
>  
>  It's hard to find the configs used by sql without the raw sql string. 
> Current UI provide a `Environment` tab that we can only get some global 
> initial config, however it's not enough.
> Some use case:
>  * Jar based job, we might set config many times due to many sql execution.
>  * SQL server e.g. (SparkThriftServer), we might execute thousands scripts 
> every day with different session.
> We expect a feature that can list the modified configs which could affect the 
> sql execution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34735) Add modified configs for SQL execution in UI

2021-03-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301073#comment-17301073
 ] 

Apache Spark commented on SPARK-34735:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/31830

> Add modified configs for SQL execution in UI
> 
>
> Key: SPARK-34735
> URL: https://issues.apache.org/jira/browse/SPARK-34735
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
> Attachments: sql-ui.jpg
>
>
> For SQL user, it's very common to add some config to optimize sql. Within a 
> script, it would look like this
> {code:java}
> set k1=v1;
> set k2=v2;
> set ...
> INSERT INTO TABLE t1
> SELECT ...
> {code}
>  
>  It's hard to find the configs used by sql without the raw sql string. 
> Current UI provide a `Environment` tab that we can only get some global 
> initial config, however it's not enough.
> Some use case:
>  * Jar based job, we might set config many times due to many sql execution.
>  * SQL server e.g. (SparkThriftServer), we might execute thousands scripts 
> every day with different session.
> We expect a feature that can list the modified configs which could affect the 
> sql execution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34735) Add modified configs for SQL execution in UI

2021-03-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301072#comment-17301072
 ] 

Apache Spark commented on SPARK-34735:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/31830

> Add modified configs for SQL execution in UI
> 
>
> Key: SPARK-34735
> URL: https://issues.apache.org/jira/browse/SPARK-34735
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
> Attachments: sql-ui.jpg
>
>
> For SQL user, it's very common to add some config to optimize sql. Within a 
> script, it would look like this
> {code:java}
> set k1=v1;
> set k2=v2;
> set ...
> INSERT INTO TABLE t1
> SELECT ...
> {code}
>  
>  It's hard to find the configs used by sql without the raw sql string. 
> Current UI provide a `Environment` tab that we can only get some global 
> initial config, however it's not enough.
> Some use case:
>  * Jar based job, we might set config many times due to many sql execution.
>  * SQL server e.g. (SparkThriftServer), we might execute thousands scripts 
> every day with different session.
> We expect a feature that can list the modified configs which could affect the 
> sql execution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34735) Add modified configs for SQL execution in UI

2021-03-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34735:


Assignee: (was: Apache Spark)

> Add modified configs for SQL execution in UI
> 
>
> Key: SPARK-34735
> URL: https://issues.apache.org/jira/browse/SPARK-34735
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
> Attachments: sql-ui.jpg
>
>
> For SQL user, it's very common to add some config to optimize sql. Within a 
> script, it would look like this
> {code:java}
> set k1=v1;
> set k2=v2;
> set ...
> INSERT INTO TABLE t1
> SELECT ...
> {code}
>  
>  It's hard to find the configs used by sql without the raw sql string. 
> Current UI provide a `Environment` tab that we can only get some global 
> initial config, however it's not enough.
> Some use case:
>  * Jar based job, we might set config many times due to many sql execution.
>  * SQL server e.g. (SparkThriftServer), we might execute thousands scripts 
> every day with different session.
> We expect a feature that can list the modified configs which could affect the 
> sql execution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34737) Discrepancy between TIMESTAMP_SECONDS and cast from float

2021-03-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301076#comment-17301076
 ] 

Apache Spark commented on SPARK-34737:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31831

> Discrepancy between TIMESTAMP_SECONDS and cast from float
> -
>
> Key: SPARK-34737
> URL: https://issues.apache.org/jira/browse/SPARK-34737
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> TIMESTAMP_SECONDS and CAST( AS TIMESTAMP) have different results:
> {code:sql}
> spark-sql> SELECT CAST(16777215.0f AS TIMESTAMP);
> 1970-07-14 07:20:15
> spark-sql> SELECT TIMESTAMP_SECONDS(16777215.0f);
> 1970-07-14 07:20:14.951424
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34737) Discrepancy between TIMESTAMP_SECONDS and cast from float

2021-03-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34737:


Assignee: Apache Spark

> Discrepancy between TIMESTAMP_SECONDS and cast from float
> -
>
> Key: SPARK-34737
> URL: https://issues.apache.org/jira/browse/SPARK-34737
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> TIMESTAMP_SECONDS and CAST( AS TIMESTAMP) have different results:
> {code:sql}
> spark-sql> SELECT CAST(16777215.0f AS TIMESTAMP);
> 1970-07-14 07:20:15
> spark-sql> SELECT TIMESTAMP_SECONDS(16777215.0f);
> 1970-07-14 07:20:14.951424
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34737) Discrepancy between TIMESTAMP_SECONDS and cast from float

2021-03-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34737:


Assignee: (was: Apache Spark)

> Discrepancy between TIMESTAMP_SECONDS and cast from float
> -
>
> Key: SPARK-34737
> URL: https://issues.apache.org/jira/browse/SPARK-34737
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> TIMESTAMP_SECONDS and CAST( AS TIMESTAMP) have different results:
> {code:sql}
> spark-sql> SELECT CAST(16777215.0f AS TIMESTAMP);
> 1970-07-14 07:20:15
> spark-sql> SELECT TIMESTAMP_SECONDS(16777215.0f);
> 1970-07-14 07:20:14.951424
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34737) Discrepancy between TIMESTAMP_SECONDS and cast from float

2021-03-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301077#comment-17301077
 ] 

Apache Spark commented on SPARK-34737:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31831

> Discrepancy between TIMESTAMP_SECONDS and cast from float
> -
>
> Key: SPARK-34737
> URL: https://issues.apache.org/jira/browse/SPARK-34737
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> TIMESTAMP_SECONDS and CAST( AS TIMESTAMP) have different results:
> {code:sql}
> spark-sql> SELECT CAST(16777215.0f AS TIMESTAMP);
> 1970-07-14 07:20:15
> spark-sql> SELECT TIMESTAMP_SECONDS(16777215.0f);
> 1970-07-14 07:20:14.951424
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-14 Thread Attila Zsolt Piros (Jira)

Attila Zsolt Piros created SPARK-34738:
--

 Summary: Upgrade Minikube and kubernetes cluster version on Jenkins
 Key: SPARK-34738
 URL: https://issues.apache.org/jira/browse/SPARK-34738
 Project: Spark
  Issue Type: Task
  Components: jenkins, Kubernetes
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros


[~shaneknapp] as we discussed [on the mailing 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
 Minikube can be upgraded to the latest (v1.18.1) and kubernetes version should 
be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).

[Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
method to configure the kubernetes client. Thanks in advance to use it for 
testing on the Jenkins after the Minikube version is updated.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34739) Add an year-month interval to a timestamp

2021-03-14 Thread Max Gekk (Jira)

Max Gekk created SPARK-34739:


 Summary: Add an year-month interval to a timestamp
 Key: SPARK-34739
 URL: https://issues.apache.org/jira/browse/SPARK-34739
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 3.2.0


Support adding of YearMonthIntervalType values to DATE values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34739) Add an year-month interval to a timestamp

2021-03-14 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-34739:
-
Description: Support adding of YearMonthIntervalType values to TIMESTAMP 
values.  (was: Support adding of YearMonthIntervalType values to DATE values.)

> Add an year-month interval to a timestamp
> -
>
> Key: SPARK-34739
> URL: https://issues.apache.org/jira/browse/SPARK-34739
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support adding of YearMonthIntervalType values to TIMESTAMP values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32082) Project Zen: Improving Python usability

2021-03-14 Thread Danny Meijer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301176#comment-17301176
 ] 

Danny Meijer commented on SPARK-32082:
--

That's true - the first time around it will be a HUGE commit (as quite a lot of 
the files will change). Ran a little dry-run to judge how much will change:

183 files would be reformatted, 91 files would be left unchanged.

> Project Zen: Improving Python usability
> ---
>
> Key: SPARK-32082
> URL: https://issues.apache.org/jira/browse/SPARK-32082
> Project: Spark
>  Issue Type: Epic
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> The importance of Python and PySpark has grown radically in the last few 
> years. The number of PySpark downloads reached [more than 1.3 million _every 
> week_|https://pypistats.org/packages/pyspark] when we count them _only_ in 
> PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error 
> messages as an example, and the API documentation is poorly written.
> This epic tickets aims to improve the usability in PySpark, and make it more 
> Pythonic. To be more explicit, this JIRA targets four bullet points below. 
> Each includes examples:
>  * Being Pythonic
>  ** Pandas UDF enhancements and type hints
>  ** Avoid dynamic function definitions, for example, at {{funcitons.py}} 
> which makes IDEs unable to detect.
>  * Better and easier usability in PySpark
>  ** User-facing error message and warnings
>  ** Documentation
>  ** User guide
>  ** Better examples and API documentation, e.g. 
> [Koalas|https://koalas.readthedocs.io/en/latest/] and 
> [pandas|https://pandas.pydata.org/docs/]
>  * Better interoperability with other Python libraries
>  ** Visualization and plotting
>  ** Potentially better interface by leveraging Arrow
>  ** Compatibility with other libraries such as NumPy universal functions or 
> pandas possibly by leveraging Koalas
>  * PyPI Installation
>  ** PySpark with Hadoop 3 support on PyPi
>  ** Better error handling



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34739) Add an year-month interval to a timestamp

2021-03-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301266#comment-17301266
 ] 

Apache Spark commented on SPARK-34739:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31832

> Add an year-month interval to a timestamp
> -
>
> Key: SPARK-34739
> URL: https://issues.apache.org/jira/browse/SPARK-34739
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support adding of YearMonthIntervalType values to TIMESTAMP values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34739) Add an year-month interval to a timestamp

2021-03-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34739:


Assignee: Max Gekk  (was: Apache Spark)

> Add an year-month interval to a timestamp
> -
>
> Key: SPARK-34739
> URL: https://issues.apache.org/jira/browse/SPARK-34739
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support adding of YearMonthIntervalType values to TIMESTAMP values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34739) Add an year-month interval to a timestamp

2021-03-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34739:


Assignee: Apache Spark  (was: Max Gekk)

> Add an year-month interval to a timestamp
> -
>
> Key: SPARK-34739
> URL: https://issues.apache.org/jira/browse/SPARK-34739
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> Support adding of YearMonthIntervalType values to TIMESTAMP values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31427) Spark Structure streaming read data twice per every micro-batch.

2021-03-14 Thread Nick Hryhoriev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301281#comment-17301281
 ] 

Nick Hryhoriev commented on SPARK-31427:


hi [~xccui] , after some deep investigation, I can add:

There is no issue with the `reaprtitionByRange` SQL query.
Spark generate 2 or 3 spark job, and this is expected to behave.
1. For a sampling
2. For actual repartition
3. Only in case using `show`, which actually `take` spark create one more job 
but with the skipped first stage. I don't find why nut any way it's not related 
to this issue.

There is no issue with the  `sort` + `show`(`take`) SQL query, spark generate 1 
spark job, and it's ok.
While `sort` still uses RangePartitioner, SqlQuery planner optimizes `sort` + 
`take` into `*TakeOrderedAndProject*`.
And If we will change `show` to `.foreach(x => println(x))` spark will generate 
two query.

But what I really don't understand is: 
Why in case`sortWithinPartitions`+ `show`(`take`)SQL query, spark generate two 
spark job.
This is the only open question in this issue, From my point of view, looks like 
an issue in the spark SQL query planner.

> Spark Structure streaming read data twice per every micro-batch.
> 
>
> Key: SPARK-31427
> URL: https://issues.apache.org/jira/browse/SPARK-31427
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3, 2.4.7, 3.0.1, 3.0.2
>Reporter: Nick Hryhoriev
>Priority: Major
>
> I have a very strange issue with spark structure streaming. Spark structure 
> streaming creates two spark jobs for every micro-batch. As a result, read 
> data from Kafka twice. Here is a simple code snippet.
>  
> {code:java}
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.streaming.Trigger
> object CheckHowSparkReadFromKafka {
>   def main(args: Array[String]): Unit = {
> val session = SparkSession.builder()
>   .config(new SparkConf()
> .setAppName(s"simple read from kafka with repartition")
> .setMaster("local[*]")
> .set("spark.driver.host", "localhost"))
>   .getOrCreate()
> val testPath = "/tmp/spark-test"
> FileSystem.get(session.sparkContext.hadoopConfiguration).delete(new 
> Path(testPath), true)
> import session.implicits._
> val stream = session
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers","kafka-20002-prod:9092")
>   .option("subscribe", "topic")
>   .option("maxOffsetsPerTrigger", 1000)
>   .option("failOnDataLoss", false)
>   .option("startingOffsets", "latest")
>   .load()
>   .repartitionByRange( $"offset")
>   .writeStream
>   .option("path", testPath + "/data")
>   .option("checkpointLocation", testPath + "/checkpoint")
>   .format("parquet")
>   .trigger(Trigger.ProcessingTime(10.seconds))
>   .start()
> stream.processAllAvailable()
> {code}
> This happens because if {{.repartitionByRange( $"offset")}}, if I remove this 
> line, all good. But with spark create two jobs, one with 1 stage just read 
> from Kafka, the second with 3 stage read -> shuffle -> write. So the result 
> of the first job never used.
> This has a significant impact on performance. Some of my Kafka topics have 
> 1550 partitions, so read them twice is a big deal. In case I add cache, 
> things going better, but this is not a way for me. In local mode, the first 
> job in batch takes less than 0.1 ms, except batch with index 0. But in YARN 
> cluster and Messos both jobs fully expected and on my topics take near 1.2 
> min.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31427) Spark Structure streaming read data twice per every micro-batch.

2021-03-14 Thread Nick Hryhoriev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301281#comment-17301281
 ] 

Nick Hryhoriev edited comment on SPARK-31427 at 3/14/21, 9:26 PM:
--

hi [~xccui] , after some deep investigation, I can add:

There is no issue with the `reaprtitionByRange` SQL query.
 Spark generate 2 or 3 spark job, and this is expected behavior.
 1. For a sampling
 2. For actual repartition
 3. Only in case using `show`, which actually `take` spark create one more job 
but with the skipped first stage. I don't find why nut any way it's not related 
to this issue.

There is no issue with the  `sort` + `show`(`take`) SQL query, spark generate 1 
spark job, and it's ok.
 While `sort` still uses RangePartitioner, SqlQuery planner optimizes `sort` + 
`take` into `*TakeOrderedAndProject*`.
 And If we will change `show` to `.foreach(x => println(x))` spark will 
generate two query.

But what I really don't understand is: 
 Why in case`sortWithinPartitions`+ `show`(`take`)SQL query, spark generate two 
spark job.
 This is the only open question in this issue, From my point of view, looks 
like an issue in the spark SQL query planner.


was (Author: hryhoriev.nick):
hi [~xccui] , after some deep investigation, I can add:

There is no issue with the `reaprtitionByRange` SQL query.
Spark generate 2 or 3 spark job, and this is expected to behave.
1. For a sampling
2. For actual repartition
3. Only in case using `show`, which actually `take` spark create one more job 
but with the skipped first stage. I don't find why nut any way it's not related 
to this issue.

There is no issue with the  `sort` + `show`(`take`) SQL query, spark generate 1 
spark job, and it's ok.
While `sort` still uses RangePartitioner, SqlQuery planner optimizes `sort` + 
`take` into `*TakeOrderedAndProject*`.
And If we will change `show` to `.foreach(x => println(x))` spark will generate 
two query.

But what I really don't understand is: 
Why in case`sortWithinPartitions`+ `show`(`take`)SQL query, spark generate two 
spark job.
This is the only open question in this issue, From my point of view, looks like 
an issue in the spark SQL query planner.

> Spark Structure streaming read data twice per every micro-batch.
> 
>
> Key: SPARK-31427
> URL: https://issues.apache.org/jira/browse/SPARK-31427
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3, 2.4.7, 3.0.1, 3.0.2
>Reporter: Nick Hryhoriev
>Priority: Major
>
> I have a very strange issue with spark structure streaming. Spark structure 
> streaming creates two spark jobs for every micro-batch. As a result, read 
> data from Kafka twice. Here is a simple code snippet.
>  
> {code:java}
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.streaming.Trigger
> object CheckHowSparkReadFromKafka {
>   def main(args: Array[String]): Unit = {
> val session = SparkSession.builder()
>   .config(new SparkConf()
> .setAppName(s"simple read from kafka with repartition")
> .setMaster("local[*]")
> .set("spark.driver.host", "localhost"))
>   .getOrCreate()
> val testPath = "/tmp/spark-test"
> FileSystem.get(session.sparkContext.hadoopConfiguration).delete(new 
> Path(testPath), true)
> import session.implicits._
> val stream = session
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers","kafka-20002-prod:9092")
>   .option("subscribe", "topic")
>   .option("maxOffsetsPerTrigger", 1000)
>   .option("failOnDataLoss", false)
>   .option("startingOffsets", "latest")
>   .load()
>   .repartitionByRange( $"offset")
>   .writeStream
>   .option("path", testPath + "/data")
>   .option("checkpointLocation", testPath + "/checkpoint")
>   .format("parquet")
>   .trigger(Trigger.ProcessingTime(10.seconds))
>   .start()
> stream.processAllAvailable()
> {code}
> This happens because if {{.repartitionByRange( $"offset")}}, if I remove this 
> line, all good. But with spark create two jobs, one with 1 stage just read 
> from Kafka, the second with 3 stage read -> shuffle -> write. So the result 
> of the first job never used.
> This has a significant impact on performance. Some of my Kafka topics have 
> 1550 partitions, so read them twice is a big deal. In case I add cache, 
> things going better, but this is not a way for me. In local mode, the first 
> job in batch takes less than 0.1 ms, except batch with index 0. But in YARN 
> cluster and Messos both jobs fully expected and on my topics take near 1.2 
> min.
>  
>  



--
This message was

[jira] [Commented] (SPARK-31427) Spark Structure streaming read data twice per every micro-batch.

2021-03-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301291#comment-17301291
 ] 

Jungtaek Lim commented on SPARK-31427:
--

If you have to use foreachBatch to enable specific operations, you're already 
aware that these operations are not available in streaming query.
(For example, the "sort" operation doesn't make sense in "unbounded" data. Also 
you're encouraged to use "console sink" instead of show for streaming query.)

foreachBatch makes dataset in streaming query in each microbatch to "batch" 
dataset and you're free to apply operations to the "batch" dataset, though 
further operations may bring additional jobs.

If you only concern about reading Kafka data even with second job, please try 
"persist" on data parameter when you get from foreachBatch, and "unpersist" 
when you are done with the dataset.

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch


> Spark Structure streaming read data twice per every micro-batch.
> 
>
> Key: SPARK-31427
> URL: https://issues.apache.org/jira/browse/SPARK-31427
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3, 2.4.7, 3.0.1, 3.0.2
>Reporter: Nick Hryhoriev
>Priority: Major
>
> I have a very strange issue with spark structure streaming. Spark structure 
> streaming creates two spark jobs for every micro-batch. As a result, read 
> data from Kafka twice. Here is a simple code snippet.
>  
> {code:java}
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.streaming.Trigger
> object CheckHowSparkReadFromKafka {
>   def main(args: Array[String]): Unit = {
> val session = SparkSession.builder()
>   .config(new SparkConf()
> .setAppName(s"simple read from kafka with repartition")
> .setMaster("local[*]")
> .set("spark.driver.host", "localhost"))
>   .getOrCreate()
> val testPath = "/tmp/spark-test"
> FileSystem.get(session.sparkContext.hadoopConfiguration).delete(new 
> Path(testPath), true)
> import session.implicits._
> val stream = session
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers","kafka-20002-prod:9092")
>   .option("subscribe", "topic")
>   .option("maxOffsetsPerTrigger", 1000)
>   .option("failOnDataLoss", false)
>   .option("startingOffsets", "latest")
>   .load()
>   .repartitionByRange( $"offset")
>   .writeStream
>   .option("path", testPath + "/data")
>   .option("checkpointLocation", testPath + "/checkpoint")
>   .format("parquet")
>   .trigger(Trigger.ProcessingTime(10.seconds))
>   .start()
> stream.processAllAvailable()
> {code}
> This happens because if {{.repartitionByRange( $"offset")}}, if I remove this 
> line, all good. But with spark create two jobs, one with 1 stage just read 
> from Kafka, the second with 3 stage read -> shuffle -> write. So the result 
> of the first job never used.
> This has a significant impact on performance. Some of my Kafka topics have 
> 1550 partitions, so read them twice is a big deal. In case I add cache, 
> things going better, but this is not a way for me. In local mode, the first 
> job in batch takes less than 0.1 ms, except batch with index 0. But in YARN 
> cluster and Messos both jobs fully expected and on my topics take near 1.2 
> min.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34737) Discrepancy between TIMESTAMP_SECONDS and cast from float

2021-03-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34737:


Assignee: Max Gekk

> Discrepancy between TIMESTAMP_SECONDS and cast from float
> -
>
> Key: SPARK-34737
> URL: https://issues.apache.org/jira/browse/SPARK-34737
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> TIMESTAMP_SECONDS and CAST( AS TIMESTAMP) have different results:
> {code:sql}
> spark-sql> SELECT CAST(16777215.0f AS TIMESTAMP);
> 1970-07-14 07:20:15
> spark-sql> SELECT TIMESTAMP_SECONDS(16777215.0f);
> 1970-07-14 07:20:14.951424
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34737) Discrepancy between TIMESTAMP_SECONDS and cast from float

2021-03-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34737.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31831
[https://github.com/apache/spark/pull/31831]

> Discrepancy between TIMESTAMP_SECONDS and cast from float
> -
>
> Key: SPARK-34737
> URL: https://issues.apache.org/jira/browse/SPARK-34737
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> TIMESTAMP_SECONDS and CAST( AS TIMESTAMP) have different results:
> {code:sql}
> spark-sql> SELECT CAST(16777215.0f AS TIMESTAMP);
> 1970-07-14 07:20:15
> spark-sql> SELECT TIMESTAMP_SECONDS(16777215.0f);
> 1970-07-14 07:20:14.951424
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21449) Hive client's SessionState was not closed properly in HiveExternalCatalog

2021-03-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301332#comment-17301332
 ] 

Apache Spark commented on SPARK-21449:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31833

> Hive client's SessionState was not closed properly  in HiveExternalCatalog
> --
>
> Key: SPARK-21449
> URL: https://issues.apache.org/jira/browse/SPARK-21449
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: bulk-closed
>
> close the sessionstate to clear `hive.downloaded.resources.dir` and else.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped

2021-03-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301334#comment-17301334
 ] 

Apache Spark commented on SPARK-23745:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31833

> Remove the directories of the “hive.downloaded.resources.dir” when 
> HiveThriftServer2 stopped
> 
>
> Key: SPARK-23745
> URL: https://issues.apache.org/jira/browse/SPARK-23745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: zuotingbing
>Priority: Major
>  Labels: bulk-closed
> Attachments: 2018-03-20_164832.png
>
>
> !2018-03-20_164832.png!  
> when start the HiveThriftServer2, we create some directories for 
> hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
> remove these directories. The directories could accumulate a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped

2021-03-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301335#comment-17301335
 ] 

Apache Spark commented on SPARK-23745:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31833

> Remove the directories of the “hive.downloaded.resources.dir” when 
> HiveThriftServer2 stopped
> 
>
> Key: SPARK-23745
> URL: https://issues.apache.org/jira/browse/SPARK-23745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: zuotingbing
>Priority: Major
>  Labels: bulk-closed
> Attachments: 2018-03-20_164832.png
>
>
> !2018-03-20_164832.png!  
> when start the HiveThriftServer2, we create some directories for 
> hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
> remove these directories. The directories could accumulate a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34740) can class pyspark.mllib.evaluation.RegressionMetrics() accept DataFrame as the parameter?

2021-03-14 Thread xifeng (Jira)

xifeng created SPARK-34740:
--

 Summary: can class pyspark.mllib.evaluation.RegressionMetrics() 
accept DataFrame as the parameter?
 Key: SPARK-34740
 URL: https://issues.apache.org/jira/browse/SPARK-34740
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.4.6
Reporter: xifeng


class pyspark.mllib.evaluation.RegressionMetrics(predictionAndObservations) can 
only accept  an RDD of (prediction, observation) pairs as the parameter.

Can it also accept the DataFrame as the parameter? Which would be much more 
convenient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34740) can class pyspark.mllib.evaluation.RegressionMetrics() accept DataFrame as the parameter?

2021-03-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301371#comment-17301371
 ] 

Hyukjin Kwon commented on SPARK-34740:
--

pyspark.mllib is for RDD only. You should implement this in pyspark.ml

> can class pyspark.mllib.evaluation.RegressionMetrics() accept DataFrame as 
> the parameter?
> -
>
> Key: SPARK-34740
> URL: https://issues.apache.org/jira/browse/SPARK-34740
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.6
>Reporter: xifeng
>Priority: Minor
>
> class pyspark.mllib.evaluation.RegressionMetrics(predictionAndObservations) 
> can only accept  an RDD of (prediction, observation) pairs as the parameter.
> Can it also accept the DataFrame as the parameter? Which would be much more 
> convenient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34740) can class pyspark.mllib.evaluation.RegressionMetrics() accept DataFrame as the parameter?

2021-03-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34740.
--
Resolution: Invalid

> can class pyspark.mllib.evaluation.RegressionMetrics() accept DataFrame as 
> the parameter?
> -
>
> Key: SPARK-34740
> URL: https://issues.apache.org/jira/browse/SPARK-34740
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.6
>Reporter: xifeng
>Priority: Minor
>
> class pyspark.mllib.evaluation.RegressionMetrics(predictionAndObservations) 
> can only accept  an RDD of (prediction, observation) pairs as the parameter.
> Can it also accept the DataFrame as the parameter? Which would be much more 
> convenient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34733) Spark UI not showing memory used of partitions in memory

2021-03-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301372#comment-17301372
 ] 

Hyukjin Kwon commented on SPARK-34733:
--

[~sams] can you confirm if this happens in Apache Spark instead of EMR Spark?

> Spark UI not showing memory used of partitions in memory
> 
>
> Key: SPARK-34733
> URL: https://issues.apache.org/jira/browse/SPARK-34733
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: EMR AWS emr-6.0.0
>Reporter: sam
>Priority: Major
> Attachments: Screenshot 2021-03-13 at 16.31.06.png
>
>
> We have a job that caches RDDs into memory.  We know the code to cache is 
> working as the spark logs correctly point out the caching is happening:
> ```
> 21/03/13 16:17:38 INFO BlockManagerInfo: *Added rdd_12_413 in memory* on 
> ip-172-31-24-152.eu-west-2.compute.internal:43849 *(size: 201.4 MB, free: 
> 575.0 GB)*
> 21/03/13 16:17:38 INFO TaskSetManager: Starting task 897.0 in stage 1.0 (TID 
> 10277, ip-172-31-24-152.eu-west-2.compute.internal, executor 8, partition 
> 897, RACK_LOCAL, 9812 bytes)
> 21/03/13 16:17:38 INFO TaskSetManager: Finished task 413.0 in stage 1.0 (TID 
> 9793) in 1250463 ms on ip-172-31-24-152.eu-west-2.compute.internal (executor 
> 8) (34/2162)
> 21/03/13 16:17:42 INFO BlockManagerInfo: Added rdd_12_768 in memory on 
> ip-172-31-23-154.eu-west-2.compute.internal:37957 (size: 718.4 MB, free: 
> 574.6 GB)
> 21/03/13 16:17:43 INFO TaskSetManager: Starting task 898.0 in stage 1.0 (TID 
> 10278, ip-172-31-23-154.eu-west-2.compute.internal, executor 7, partition 
> 898, RACK_LOCAL, 9841 bytes)
> 21/03/13 16:17:43 INFO TaskSetManager: Finished task 768.0 in stage 1.0 (TID 
> 10148) in 1254945 ms on ip-172-31-23-154.eu-west-2.compute.internal (executor 
> 7) (35/2162)
> 21/03/13 16:18:44 INFO BlockManagerInfo: Added rdd_12_409 in memory on 
> ip-172-31-21-66.eu-west-2.compute.internal:38921 (size: 177.6 MB, free: 575.1 
> GB)
> 21/03/13 16:18:45 INFO TaskSetManager: Starting task 899.0 in stage 1.0 (TID 
> 10279, ip-172-31-21-66.eu-west-2.compute.internal, executor 4, partition 899, 
> RACK_LOCAL, 9828 bytes)
> 21/03/13 16:18:45 INFO TaskSetManager: Finished task 409.0 in stage 1.0 (TID 
> 9789) in 1316584 ms on ip-172-31-21-66.eu-west-2.compute.internal (executor 
> 4) (36/2162)
> 21/03/13 16:19:40 INFO BlockManagerInfo: Added rdd_12_400 in memory on 
> ip-172-31-21-66.eu-west-2.compute.internal:38921 (size: 187.9 MB, free: 574.9 
> GB)
> 21/03/13 16:19:41 INFO TaskSetManager: Starting task 900.0 in stage 1.0 (TID 
> 10280, ip-172-31-21-66.eu-west-2.compute.internal, executor 4, partition 900, 
> RACK_LOCAL, 9843 bytes)
> 21/03/13 16:19:41 INFO TaskSetManager: Finished task 400.0 in stage 1.0 (TID 
> 9780) in 1372717 ms on ip-172-31-21-66.eu-west-2.compute.internal (executor 
> 4) (37/2162)
> 21/03/13 16:20:55 INFO BlockManagerInfo: Added rdd_12_640 in memory on 
> ip-172-31-17-157.eu-west-2.compute.internal:34005 (size: 576.1 MB, free: 
> 574.7 GB)
> 21/03/13 16:20:58 INFO TaskSetManager: Starting task 901.0 in stage 1.0 (TID 
> 10281, ip-172-31-17-157.eu-west-2.compute.internal, executor 9, partition 
> 901, RACK_LOCAL, 9750 bytes)
> 21/03/13 16:20:58 INFO TaskSetManager: Finished task 640.0 in stage 1.0 (TID 
> 10020) in 1449618 ms on ip-172-31-17-157.eu-west-2.compute.internal (executor 
> 9) (38/2162)
> 21/03/13 16:21:07 INFO BlockManagerInfo: Added rdd_12_610 in memory on 
> ip-172-31-30-188.eu-west-2.compute.internal:38111 (size: 582.2 MB, free: 
> 574.7 GB)
> ```
> But when we look in the Spark UI Executors tab it shows 0 B used of the 
> maximum. Please see screenshot:
>  !Screenshot 2021-03-13 at 16.31.06.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34714) collect_list(struct()) fails when used with GROUP BY

2021-03-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301373#comment-17301373
 ] 

Hyukjin Kwon commented on SPARK-34714:
--

[~laurikoobas] can you provide a self-contained reproducer?

> collect_list(struct()) fails when used with GROUP BY
> 
>
> Key: SPARK-34714
> URL: https://issues.apache.org/jira/browse/SPARK-34714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Databricks Runtime 8.0
>Reporter: Lauri Koobas
>Priority: Major
>
> The following is failing in DBR8.0 / Spark 3.1.1, but works in earlier DBR 
> and Spark versions:
> {quote}with step_1 as (
>     select 'E' as name, named_struct('subfield', 1) as field_1
> )
> select name, collect_list(struct(field_1.subfield))
> from step_1
> group by 1
> {quote}
> Fails with the following error message:
> {quote}AnalysisException: cannot resolve 
> 'struct(step_1.`field_1`.`subfield`)' due to data type mismatch: Only 
> foldable string expressions are allowed to appear at odd position, got: 
> NamePlaceholder
> {quote}
> If you modify the query in any of the following ways then it still works::
>  * if you remove the field "name" and the "group by 1" part of the query
>  * if you remove the "struct()" from within the collect_list()
>  * if you use "named_struct()" instead of "struct()" within the collect_list()
> Similarly collect_set() is broken and possibly more related functions, but I 
> haven't done thorough testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34694) Improve Spark SQL Source Filter to allow pushdown of filters span multiple columns

2021-03-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301375#comment-17301375
 ] 

Hyukjin Kwon commented on SPARK-34694:
--

{code}
(l_commitdate#11 < l_receiptdate#12)
(l_shipdate#10 < l_commitdate#11)
{code}

will be

{{And(LessThen(l_commitdate, l_receiptdate), LessThen(l_shipdate, 
l_commitdate))}}. I think this seems fine. Do you mind elabourating which kind 
of design you have in mind?

> Improve Spark SQL Source Filter to allow pushdown of filters span multiple 
> columns
> --
>
> Key: SPARK-34694
> URL: https://issues.apache.org/jira/browse/SPARK-34694
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Chen Zou
>Priority: Minor
>
> The current org.apache.spark.sql.sources.Filter abstract class only allows 
> pushdown of filters on single column or sum of products of multiple such 
> single-column filters.
> Filters on multiple columns cannot be pushed down through this Filter 
> subclass to source, e.g. from TPC-H benchmark on lineitem table:
> (l_commitdate#11 < l_receiptdate#12)
> (l_shipdate#10 < l_commitdate#11)
>  
> The current design probably originates from the point that columnar source 
> has a hard time supporting these cross-column filters. But with batching 
> implemented in columnar sources, they can still support cross-column filters. 
>  This issue tries to open up discussion on a more general Filter interface to 
> allow pushing down cross-column filters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34693) Support null in conversions to and from Arrow

2021-03-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301376#comment-17301376
 ] 

Hyukjin Kwon commented on SPARK-34693:
--

Which codes did you run? It would be great if you can provide the reproducer

> Support null in conversions to and from Arrow
> -
>
> Key: SPARK-34693
> URL: https://issues.apache.org/jira/browse/SPARK-34693
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Laurens
>Priority: Minor
>
> Looks like a regression from or related to SPARK-33489.
> I got the following running spark 3.1.1. with koalas 1.7.0
> {code:java}
> TypeError Traceback (most recent call last)
> /var/scratch/miniconda3/lib/python3.8/site-packages/pyspark/sql/udf.py in 
> returnType(self)
> 100 try:
> --> 101 to_arrow_type(self._returnType_placeholder)
> 102 except TypeError:
> /var/scratch/miniconda3/lib/python3.8/site-packages/pyspark/sql/pandas/types.py
>  in to_arrow_type(dt)
>  75 else:
> ---> 76 raise TypeError("Unsupported type in conversion to Arrow: " + 
> str(dt))
>  77 return arrow_type
> TypeError: Unsupported type in conversion to Arrow: NullType
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34741) MergeIntoTable should avoid ambiguous reference

2021-03-14 Thread wuyi (Jira)

wuyi created SPARK-34741:


 Summary: MergeIntoTable should avoid ambiguous reference
 Key: SPARK-34741
 URL: https://issues.apache.org/jira/browse/SPARK-34741
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.1, 3.1.0, 3.0.2, 3.0.1, 3.0.0
Reporter: wuyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34741) MergeIntoTable should avoid ambiguous reference

2021-03-14 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-34741:
-
Description: When resolving the {{UpdateAction}}, which could reference 
attributes from both target and source tables, Spark should know clearly where 
the attribute comes from when there're conflicting attributes instead of 
picking up a random one.

> MergeIntoTable should avoid ambiguous reference
> ---
>
> Key: SPARK-34741
> URL: https://issues.apache.org/jira/browse/SPARK-34741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: wuyi
>Priority: Major
>
> When resolving the {{UpdateAction}}, which could reference attributes from 
> both target and source tables, Spark should know clearly where the attribute 
> comes from when there're conflicting attributes instead of picking up a 
> random one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34741) MergeIntoTable should avoid ambiguous reference

2021-03-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34741:


Assignee: (was: Apache Spark)

> MergeIntoTable should avoid ambiguous reference
> ---
>
> Key: SPARK-34741
> URL: https://issues.apache.org/jira/browse/SPARK-34741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: wuyi
>Priority: Major
>
> When resolving the {{UpdateAction}}, which could reference attributes from 
> both target and source tables, Spark should know clearly where the attribute 
> comes from when there're conflicting attributes instead of picking up a 
> random one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34741) MergeIntoTable should avoid ambiguous reference

2021-03-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34741:


Assignee: Apache Spark

> MergeIntoTable should avoid ambiguous reference
> ---
>
> Key: SPARK-34741
> URL: https://issues.apache.org/jira/browse/SPARK-34741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> When resolving the {{UpdateAction}}, which could reference attributes from 
> both target and source tables, Spark should know clearly where the attribute 
> comes from when there're conflicting attributes instead of picking up a 
> random one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34741) MergeIntoTable should avoid ambiguous reference

2021-03-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301396#comment-17301396
 ] 

Apache Spark commented on SPARK-34741:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/31835

> MergeIntoTable should avoid ambiguous reference
> ---
>
> Key: SPARK-34741
> URL: https://issues.apache.org/jira/browse/SPARK-34741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: wuyi
>Priority: Major
>
> When resolving the {{UpdateAction}}, which could reference attributes from 
> both target and source tables, Spark should know clearly where the attribute 
> comes from when there're conflicting attributes instead of picking up a 
> random one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34714) collect_list(struct()) fails when used with GROUP BY

2021-03-14 Thread Lauri Koobas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301403#comment-17301403
 ] 

Lauri Koobas commented on SPARK-34714:
--

Didn't I? When I copy-paste the only code example in the description of the 
ticket to my Databricks cluster running DBR 8.0 then it produces the error 
message.

> collect_list(struct()) fails when used with GROUP BY
> 
>
> Key: SPARK-34714
> URL: https://issues.apache.org/jira/browse/SPARK-34714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Databricks Runtime 8.0
>Reporter: Lauri Koobas
>Priority: Major
>
> The following is failing in DBR8.0 / Spark 3.1.1, but works in earlier DBR 
> and Spark versions:
> {quote}with step_1 as (
>     select 'E' as name, named_struct('subfield', 1) as field_1
> )
> select name, collect_list(struct(field_1.subfield))
> from step_1
> group by 1
> {quote}
> Fails with the following error message:
> {quote}AnalysisException: cannot resolve 
> 'struct(step_1.`field_1`.`subfield`)' due to data type mismatch: Only 
> foldable string expressions are allowed to appear at odd position, got: 
> NamePlaceholder
> {quote}
> If you modify the query in any of the following ways then it still works::
>  * if you remove the field "name" and the "group by 1" part of the query
>  * if you remove the "struct()" from within the collect_list()
>  * if you use "named_struct()" instead of "struct()" within the collect_list()
> Similarly collect_set() is broken and possibly more related functions, but I 
> haven't done thorough testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34725) Do not show Statistics(sizeInBytes=8.0 EiB) if we don't have valid stats

2021-03-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34725.
---
Resolution: Later

The PR is closed.

> Do not show Statistics(sizeInBytes=8.0 EiB) if we don't have valid stats
> 
>
> Key: SPARK-34725
> URL: https://issues.apache.org/jira/browse/SPARK-34725
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> In cast explian mode, we might get such explain string if the table is 
> external and we have not stats.
>  
> {code:java}
> == Optimized Logical Plan == 
> GlobalLimit 21, Statistics(sizeInBytes=1008.0 B, rowCount=21)
>  +- LocalLimit 21, Statistics(sizeInBytes=12.0 EiB)
>  +- Project [cast(c1#52 as string) AS c1#259, pt#53], 
> Statistics(sizeInBytes=12.0 EiB)
>  +- Relation default.pt1[c1#52,pt#53] parquet, Statistics(sizeInBytes=8.0 EiB)
> {code}
>  
> The reason is we use the `Long.MaxValue` as the stats with conf 
> `spark.sql.defaultSizeInBytes`. It would be better to hide the default stat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34722) Clean up deprecated API usage related to JUnit

2021-03-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34722:
-

Assignee: Yang Jie

> Clean up deprecated API usage related to JUnit
> --
>
> Key: SPARK-34722
> URL: https://issues.apache.org/jira/browse/SPARK-34722
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Maven compile compilation warnings as follows:
>  
> {code:java}
> [WARNING] [Warn] 
> /spark/common/network-common/src/test/java/org/apache/spark/network/client/TransportClientFactorySuite.java:231:
>  org.junit.rules.ExpectedException中的none()已过时
> [WARNING] [Warn] 
> /spark/common/network-common/src/test/java/org/apache/spark/network/client/TransportClientFactorySuite.java:247:
>  org.junit.rules.ExpectedException中的none()已过时
> [WARNING] [Warn] 
> /spark/common/network-common/src/test/java/org/apache/spark/network/crypto/TransportCipherSuite.java:84:
>  org.junit.Assert中的assertThat(T,org.hamcrest.Matcher)已过时
> [WARNING] [Warn] 
> /spark/launcher/src/test/java/org/apache/spark/launcher/SparkSubmitCommandBuilderSuite.java:43:
>  org.junit.rules.ExpectedException中的none()已过时
> [WARNING] [Warn] 
> /spark/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java:542:
>  org.junit.Assert中的assertThat(T,org.hamcrest.Matcher)已过时
> [WARNING] [Warn] 
> /spark/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:227:
>  org.junit.Assert中的assertThat(T,org.hamcrest.Matcher)已过时
> [WARNING] [Warn] 
> /spark/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:235:
>  org.junit.Assert中的assertThat(T,org.hamcrest.Matcher)已过时
> [WARNING] [Warn] 
> /spark/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:254:
>  org.junit.Assert中的assertThat(T,org.hamcrest.Matcher)已过时
> [WARNING] [Warn] 
> /spark/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:915:
>  org.junit.rules.ExpectedException中的none()已过时
> {code}
>  
> "已过时" means  Deprecated
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34722) Clean up deprecated API usage related to JUnit

2021-03-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34722.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31815
[https://github.com/apache/spark/pull/31815]

> Clean up deprecated API usage related to JUnit
> --
>
> Key: SPARK-34722
> URL: https://issues.apache.org/jira/browse/SPARK-34722
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> Maven compile compilation warnings as follows:
>  
> {code:java}
> [WARNING] [Warn] 
> /spark/common/network-common/src/test/java/org/apache/spark/network/client/TransportClientFactorySuite.java:231:
>  org.junit.rules.ExpectedException中的none()已过时
> [WARNING] [Warn] 
> /spark/common/network-common/src/test/java/org/apache/spark/network/client/TransportClientFactorySuite.java:247:
>  org.junit.rules.ExpectedException中的none()已过时
> [WARNING] [Warn] 
> /spark/common/network-common/src/test/java/org/apache/spark/network/crypto/TransportCipherSuite.java:84:
>  org.junit.Assert中的assertThat(T,org.hamcrest.Matcher)已过时
> [WARNING] [Warn] 
> /spark/launcher/src/test/java/org/apache/spark/launcher/SparkSubmitCommandBuilderSuite.java:43:
>  org.junit.rules.ExpectedException中的none()已过时
> [WARNING] [Warn] 
> /spark/core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java:542:
>  org.junit.Assert中的assertThat(T,org.hamcrest.Matcher)已过时
> [WARNING] [Warn] 
> /spark/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:227:
>  org.junit.Assert中的assertThat(T,org.hamcrest.Matcher)已过时
> [WARNING] [Warn] 
> /spark/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:235:
>  org.junit.Assert中的assertThat(T,org.hamcrest.Matcher)已过时
> [WARNING] [Warn] 
> /spark/core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:254:
>  org.junit.Assert中的assertThat(T,org.hamcrest.Matcher)已过时
> [WARNING] [Warn] 
> /spark/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:915:
>  org.junit.rules.ExpectedException中的none()已过时
> {code}
>  
> "已过时" means  Deprecated
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34729) Faster execution for broadcast nested loop join (left semi/anti with no condition)

2021-03-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34729:
-

Assignee: Cheng Su

> Faster execution for broadcast nested loop join (left semi/anti with no 
> condition)
> --
>
> Key: SPARK-34729
> URL: https://issues.apache.org/jira/browse/SPARK-34729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
>
> For `BroadcastNestedLoopJoinExec` left semi and left anti join without 
> condition. If we broadcast left side. Currently we check whether every row 
> from broadcast side has a match or not by iterating broadcast side a lot of 
> time - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L256-L275]
>  . This is unnecessary, as there's no condition, and we only need to check 
> whether stream side is empty or not. Create this Jira to add the 
> optimization. This can boost the affected query execution performance a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34729) Faster execution for broadcast nested loop join (left semi/anti with no condition)

2021-03-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34729.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31821
[https://github.com/apache/spark/pull/31821]

> Faster execution for broadcast nested loop join (left semi/anti with no 
> condition)
> --
>
> Key: SPARK-34729
> URL: https://issues.apache.org/jira/browse/SPARK-34729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> For `BroadcastNestedLoopJoinExec` left semi and left anti join without 
> condition. If we broadcast left side. Currently we check whether every row 
> from broadcast side has a match or not by iterating broadcast side a lot of 
> time - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L256-L275]
>  . This is unnecessary, as there's no condition, and we only need to check 
> whether stream side is empty or not. Create this Jira to add the 
> optimization. This can boost the affected query execution performance a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

41 matches

Mail list logo