[jira] [Resolved] (SPARK-32802) Avoid using SpecificInternalRow in RunLengthEncoding#Encoder

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32802.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29654
[https://github.com/apache/spark/pull/29654]

> Avoid using SpecificInternalRow in RunLengthEncoding#Encoder
> 
>
> Key: SPARK-32802
> URL: https://issues.apache.org/jira/browse/SPARK-32802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.1.0
>
>
> {{RunLengthEncoding#Encoder}} currently uses {{SpecificInternalRow}} which is 
> more expensive than using the native types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32802) Avoid using SpecificInternalRow in RunLengthEncoding#Encoder

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32802:
-

Assignee: Chao Sun

> Avoid using SpecificInternalRow in RunLengthEncoding#Encoder
> 
>
> Key: SPARK-32802
> URL: https://issues.apache.org/jira/browse/SPARK-32802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>
> {{RunLengthEncoding#Encoder}} currently uses {{SpecificInternalRow}} which is 
> more expensive than using the native types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32865:
--
Affects Version/s: 2.2.3

> python section in quickstart page doesn't display SPARK_VERSION correctly
> -
>
> Key: SPARK-32865
> URL: https://issues.apache.org/jira/browse/SPARK-32865
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.0, 3.0.1
>Reporter: Bowen Li
>Assignee: Bowen Li
>Priority: Minor
> Fix For: 2.4.8, 3.1.0, 3.0.2
>
>
> [https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]
> It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32865:
--
Affects Version/s: 2.3.4

> python section in quickstart page doesn't display SPARK_VERSION correctly
> -
>
> Key: SPARK-32865
> URL: https://issues.apache.org/jira/browse/SPARK-32865
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.0.1
>Reporter: Bowen Li
>Assignee: Bowen Li
>Priority: Minor
> Fix For: 2.4.8, 3.1.0, 3.0.2
>
>
> [https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]
> It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32865.
---
Resolution: Fixed

Issue resolved by pull request 29738
[https://github.com/apache/spark/pull/29738]

> python section in quickstart page doesn't display SPARK_VERSION correctly
> -
>
> Key: SPARK-32865
> URL: https://issues.apache.org/jira/browse/SPARK-32865
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Bowen Li
>Assignee: Bowen Li
>Priority: Minor
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> [https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]
> It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32865:
-

Assignee: Bowen Li

> python section in quickstart page doesn't display SPARK_VERSION correctly
> -
>
> Key: SPARK-32865
> URL: https://issues.apache.org/jira/browse/SPARK-32865
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Bowen Li
>Assignee: Bowen Li
>Priority: Minor
> Fix For: 2.4.8, 3.1.0, 3.0.2
>
>
> [https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]
> It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32865:
--
Affects Version/s: 2.4.7

> python section in quickstart page doesn't display SPARK_VERSION correctly
> -
>
> Key: SPARK-32865
> URL: https://issues.apache.org/jira/browse/SPARK-32865
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Bowen Li
>Priority: Minor
> Fix For: 3.1.0, 3.0.2
>
>
> [https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]
> It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32865:
--
Fix Version/s: 2.4.8

> python section in quickstart page doesn't display SPARK_VERSION correctly
> -
>
> Key: SPARK-32865
> URL: https://issues.apache.org/jira/browse/SPARK-32865
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Bowen Li
>Priority: Minor
> Fix For: 2.4.8, 3.1.0, 3.0.2
>
>
> [https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]
> It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24994) Add UnwrapCastInBinaryComparison optimizer to simplify literal types

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24994:
--
Summary: Add UnwrapCastInBinaryComparison optimizer to simplify literal 
types  (was: Support cast pushdown for integral types)

> Add UnwrapCastInBinaryComparison optimizer to simplify literal types
> 
>
> Key: SPARK-24994
> URL: https://issues.apache.org/jira/browse/SPARK-24994
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liuxian
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> For this statement: select * from table1 where a = 100;
> the data type of `a` is `smallint` , because the defaut data type of 100 is 
> `int` ,so the data type of  'a' is converted to `int`.
> In this case, it does not support push down to parquet.
> In our business, for our SQL statements, and we generally do not convert 100 
> to `smallint`, We hope that it can support push down to parquet for this 
> situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24994) Support cast pushdown for integral types

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-24994:
-

Assignee: Chao Sun

> Support cast pushdown for integral types
> 
>
> Key: SPARK-24994
> URL: https://issues.apache.org/jira/browse/SPARK-24994
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liuxian
>Assignee: Chao Sun
>Priority: Major
>
> For this statement: select * from table1 where a = 100;
> the data type of `a` is `smallint` , because the defaut data type of 100 is 
> `int` ,so the data type of  'a' is converted to `int`.
> In this case, it does not support push down to parquet.
> In our business, for our SQL statements, and we generally do not convert 100 
> to `smallint`, We hope that it can support push down to parquet for this 
> situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24994) Support cast pushdown for integral types

2020-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-24994.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29565
[https://github.com/apache/spark/pull/29565]

> Support cast pushdown for integral types
> 
>
> Key: SPARK-24994
> URL: https://issues.apache.org/jira/browse/SPARK-24994
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liuxian
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> For this statement: select * from table1 where a = 100;
> the data type of `a` is `smallint` , because the defaut data type of 100 is 
> `int` ,so the data type of  'a' is converted to `int`.
> In this case, it does not support push down to parquet.
> In our business, for our SQL statements, and we generally do not convert 100 
> to `smallint`, We hope that it can support push down to parquet for this 
> situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32866) Docker buildx now requires --push

2020-09-12 Thread Holden Karau (Jira)
Holden Karau created SPARK-32866:


 Summary: Docker buildx now requires --push
 Key: SPARK-32866
 URL: https://issues.apache.org/jira/browse/SPARK-32866
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.1, 3.0.0, 3.1.0
Reporter: Holden Karau


The buildx command has been updated and now requires --push to be added to 
ensure the image is pushed. To fix this please edit 
`./bin/docker-image-tool.sh` and verify that your images are pushed with the 
latest docker buildx



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30090) Update REPL for 2.13

2020-09-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30090:


Assignee: Karol Chmist

> Update REPL for 2.13
> 
>
> Key: SPARK-30090
> URL: https://issues.apache.org/jira/browse/SPARK-30090
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Karol Chmist
>Priority: Major
>
> The Spark REPL is a modified Scala REPL. It changed significantly in 2.13. We 
> will need to at least re-hack it, and along the way, see if we can do what's 
> necessary to customize it without so many invasive changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30090) Update REPL for 2.13

2020-09-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30090.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28545
[https://github.com/apache/spark/pull/28545]

> Update REPL for 2.13
> 
>
> Key: SPARK-30090
> URL: https://issues.apache.org/jira/browse/SPARK-30090
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Karol Chmist
>Priority: Major
> Fix For: 3.1.0
>
>
> The Spark REPL is a modified Scala REPL. It changed significantly in 2.13. We 
> will need to at least re-hack it, and along the way, see if we can do what's 
> necessary to customize it without so many invasive changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32865:


Assignee: Apache Spark

> python section in quickstart page doesn't display SPARK_VERSION correctly
> -
>
> Key: SPARK-32865
> URL: https://issues.apache.org/jira/browse/SPARK-32865
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Bowen Li
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0, 3.0.2
>
>
> [https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]
> It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32865:


Assignee: (was: Apache Spark)

> python section in quickstart page doesn't display SPARK_VERSION correctly
> -
>
> Key: SPARK-32865
> URL: https://issues.apache.org/jira/browse/SPARK-32865
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Bowen Li
>Priority: Minor
> Fix For: 3.1.0, 3.0.2
>
>
> [https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]
> It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194903#comment-17194903
 ] 

Apache Spark commented on SPARK-32865:
--

User 'bowenli86' has created a pull request for this issue:
https://github.com/apache/spark/pull/29738

> python section in quickstart page doesn't display SPARK_VERSION correctly
> -
>
> Key: SPARK-32865
> URL: https://issues.apache.org/jira/browse/SPARK-32865
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Bowen Li
>Priority: Minor
> Fix For: 3.1.0, 3.0.2
>
>
> [https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]
> It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Bowen Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194896#comment-17194896
 ] 

Bowen Li commented on SPARK-32865:
--

can someone help to assign this ticket to me? seems that I cannot assign it to 
myself

> python section in quickstart page doesn't display SPARK_VERSION correctly
> -
>
> Key: SPARK-32865
> URL: https://issues.apache.org/jira/browse/SPARK-32865
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Bowen Li
>Priority: Minor
> Fix For: 3.1.0, 3.0.2
>
>
> [https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]
> It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32865) python section in quickstart page doesn't display SPARK_VERSION correctly

2020-09-12 Thread Bowen Li (Jira)
Bowen Li created SPARK-32865:


 Summary: python section in quickstart page doesn't display 
SPARK_VERSION correctly
 Key: SPARK-32865
 URL: https://issues.apache.org/jira/browse/SPARK-32865
 Project: Spark
  Issue Type: Bug
  Components: docs, Documentation
Affects Versions: 3.0.1, 3.0.0
Reporter: Bowen Li
 Fix For: 3.1.0, 3.0.2


[https://github.com/apache/spark/blame/master/docs/quick-start.md#L402]

It should be site.SPARK_VERSION rather than {site.SPARK_VERSION}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32804) run-example failed in standalone cluster mode

2020-09-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-32804:


Assignee: Kevin Wang

> run-example failed in standalone cluster mode
> -
>
> Key: SPARK-32804
> URL: https://issues.apache.org/jira/browse/SPARK-32804
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Examples
>Affects Versions: 2.4.0, 3.0.0
> Environment: Spark 3.0 
>Reporter: Kevin Wang
>Assignee: Kevin Wang
>Priority: Minor
> Attachments: image-2020-09-05-21-55-00-227.png
>
>
> run-example failed in standalone cluster mode (seems like something wrong in 
> SparkSubmitCommand Build): 
>  
>   !image-2020-09-05-21-55-00-227.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32804) run-example failed in standalone cluster mode

2020-09-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32804.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29653
[https://github.com/apache/spark/pull/29653]

> run-example failed in standalone cluster mode
> -
>
> Key: SPARK-32804
> URL: https://issues.apache.org/jira/browse/SPARK-32804
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Examples
>Affects Versions: 2.4.0, 3.0.0
> Environment: Spark 3.0 
>Reporter: Kevin Wang
>Assignee: Kevin Wang
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: image-2020-09-05-21-55-00-227.png
>
>
> run-example failed in standalone cluster mode (seems like something wrong in 
> SparkSubmitCommand Build): 
>  
>   !image-2020-09-05-21-55-00-227.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32067) [K8S] Executor pod template config map of ongoing submission got inadvertently altered by subsequent submission

2020-09-12 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially,  
with app2 launching while app1 is still in the middle of ramping up all its 
executor pods. The unwanted result is that some launched executor pods of app1 
end up having app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because both apps use the same 
configmap (name). This causes some app1's executor pods being ramped up after 
app2 is launched to be inadvertently launched with the app2's pod template. The 
issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app, ideally the same 
way as the driver configmap:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app1--podspec-configmap1   13m57s
default  app2--driver-conf-map  1   10s 
default  app2--podspec-configmap1   3m{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially,  
with app2 launching while app1 is still in the middle of ramping up all its 
executor pods. The unwanted result is that some launched executor pods of app1 
end up having app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because both apps use the same 
configmap (name). This causes some app1's executor pods being ramped up after 
app2 is launched to be inadvertently launched with the app2's pod template. The 
issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app1--podspec-configmap1   13m57s
default  app2--driver-conf-map  1   10s 
default  app2--podspec-configmap1   3m{code}


> [K8S] Executor pod template config map of ongoing submission got 
> inadvertently altered by subsequent submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap 

[jira] [Updated] (SPARK-32067) [K8S] Executor pod template config map of ongoing submission got inadvertently altered by subsequent submission

2020-09-12 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially,  
with app2 launching while app1 is still in the middle of ramping up all its 
executor pods. The unwanted result is that some launched executor pods of app1 
end up having app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because both apps use the same 
configmap (name). This causes some app1's executor pods being ramped up after 
app2 is launched to be inadvertently launched with the app2's pod template. The 
issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app1--podspec-configmap1   13m57s
default  app2--driver-conf-map  1   10s 
default  app2--podspec-configmap1   3m{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially,  
with app2 launching while app1 is still in the middle of ramping up all its 
executor pods. The unwanted result is that some launched executor pods of app1 
end up having app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because the configmap names of 
the two apps are the same. This causes some app1's executor pods being ramped 
up after app2 is launched to be inadvertently launched with the app2's pod 
template. The issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app1--podspec-configmap1   13m57s
default  app2--driver-conf-map  1   10s 
default  app2--podspec-configmap1   3m{code}


> [K8S] Executor pod template config map of ongoing submission got 
> inadvertently altered by subsequent submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the 

[jira] [Updated] (SPARK-32067) [K8S] Executor pod template config map of ongoing submission got inadvertently altered by subsequent submission

2020-09-12 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially,  
with app2 launching while app1 is still in the middle of ramping up all its 
executor pods. The unwanted result is that some launched executor pods of app1 
end up having app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because the configmap names of 
the two apps are the same. This causes some app1's executor pods being ramped 
up after app2 is launched to be inadvertently launched with the app2's pod 
template. The issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app1--podspec-configmap1   13m57s
default  app2--driver-conf-map  1   10s 
default  app2--podspec-configmap1   3m{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and with app2 launching while app1 is still ramping up all its executor pods. 
The unwanted result is that some launched executor pods of app1 end up having 
app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because the configmap names of 
the two apps are the same. This causes some app1's executor pods being ramped 
up after app2 is launched to be inadvertently launched with the app2's pod 
template. The issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app1--podspec-configmap1   13m57s
default  app2--driver-conf-map  1   10s 
default  app2--podspec-configmap1   3m{code}


> [K8S] Executor pod template config map of ongoing submission got 
> inadvertently altered by subsequent submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the 

[jira] [Updated] (SPARK-32067) [K8S] Executor pod template config map of ongoing submission got inadvertently altered by subsequent submission

2020-09-12 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Summary: [K8S] Executor pod template config map of ongoing submission got 
inadvertently altered by subsequent submission  (was: [K8S] Executor pod 
template of ongoing submission got inadvertently altered by subsequent 
submission)

> [K8S] Executor pod template config map of ongoing submission got 
> inadvertently altered by subsequent submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and with app2 launching while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 end up having app2's executor pod template applied to them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because the configmap names of 
> the two apps are the same. This causes some app1's executor pods being ramped 
> up after app2 is launched to be inadvertently launched with the app2's pod 
> template. The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32864) Support ORC forced positional evolution

2020-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32864:


Assignee: Apache Spark

> Support ORC forced positional evolution
> ---
>
> Key: SPARK-32864
> URL: https://issues.apache.org/jira/browse/SPARK-32864
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Minor
>
> Hive respects "orc.force.positional.evolution" config, Spark should do it as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32864) Support ORC forced positional evolution

2020-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194763#comment-17194763
 ] 

Apache Spark commented on SPARK-32864:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29737

> Support ORC forced positional evolution
> ---
>
> Key: SPARK-32864
> URL: https://issues.apache.org/jira/browse/SPARK-32864
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Peter Toth
>Priority: Minor
>
> Hive respects "orc.force.positional.evolution" config, Spark should do it as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32864) Support ORC forced positional evolution

2020-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32864:


Assignee: (was: Apache Spark)

> Support ORC forced positional evolution
> ---
>
> Key: SPARK-32864
> URL: https://issues.apache.org/jira/browse/SPARK-32864
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Peter Toth
>Priority: Minor
>
> Hive respects "orc.force.positional.evolution" config, Spark should do it as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32864) Support ORC forced positional evolution

2020-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32864:


Assignee: (was: Apache Spark)

> Support ORC forced positional evolution
> ---
>
> Key: SPARK-32864
> URL: https://issues.apache.org/jira/browse/SPARK-32864
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Peter Toth
>Priority: Minor
>
> Hive respects "orc.force.positional.evolution" config, Spark should do it as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32864) Support ORC forced positional evolution

2020-09-12 Thread Peter Toth (Jira)
Peter Toth created SPARK-32864:
--

 Summary: Support ORC forced positional evolution
 Key: SPARK-32864
 URL: https://issues.apache.org/jira/browse/SPARK-32864
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: Peter Toth


Hive respects "orc.force.positional.evolution" config, Spark should do it as 
well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-09-12 Thread karl wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

karl wang resolved SPARK-32542.
---
Resolution: Fixed

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32863) Full outer stream-stream join

2020-09-12 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194646#comment-17194646
 ] 

Cheng Su commented on SPARK-32863:
--

Will raise a PR in next couple of weeks. Thanks.

> Full outer stream-stream join
> -
>
> Key: SPARK-32863
> URL: https://issues.apache.org/jira/browse/SPARK-32863
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). With current design of stream-stream join (which marks whether the row is 
> matched or not in state store), it would be very easy to support full outer 
> join as well.
>  
> Full outer stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store. If there's a match, output all matched rows. Put the row in left side 
> state store.
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, output all matched rows and update left side rows 
> state with "matched" field to set to true. Put the right side row in right 
> side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is false.
> (4).for right side row needs to be evicted from state store, output the row 
> if "matched" field is false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32863) Full outer stream-stream join

2020-09-12 Thread Cheng Su (Jira)
Cheng Su created SPARK-32863:


 Summary: Full outer stream-stream join
 Key: SPARK-32863
 URL: https://issues.apache.org/jira/browse/SPARK-32863
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Cheng Su


Current stream-stream join supports inner, left outer and right outer join 
([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
 ). With current design of stream-stream join (which marks whether the row is 
matched or not in state store), it would be very easy to support full outer 
join as well.

 

Full outer stream-stream join will work as followed:

(1).for left side input row, check if there's a match on right side state 
store. If there's a match, output all matched rows. Put the row in left side 
state store.

(2).for right side input row, check if there's a match on left side state 
store. If there's a match, output all matched rows and update left side rows 
state with "matched" field to set to true. Put the right side row in right side 
state store.

(3).for left side row needs to be evicted from state store, output the row if 
"matched" field is false.

(4).for right side row needs to be evicted from state store, output the row if 
"matched" field is false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32838) Connot overwite different partition with same table

2020-09-12 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194644#comment-17194644
 ] 

CHC edited comment on SPARK-32838 at 9/12/20, 7:47 AM:
---

After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand.

 

 


was (Author: chenxchen):
After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:

 
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand.

 

 

> Connot overwite different partition with same table
> ---
>
> Key: SPARK-32838
> URL: https://issues.apache.org/jira/browse/SPARK-32838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: hadoop 2.7 + spark 3.0.0
>Reporter: CHC
>Priority: Major
>
> When:
> {code:java}
> CREATE TABLE tmp.spark3_snap (
> id string
> )
> PARTITIONED BY (dt string)
> STORED AS ORC
> ;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-09')
> select 10;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select 1;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select id from tmp.spark3_snap where dt='2020-09-09';
> {code}
> and it will be get a error: "Cannot overwrite a path that is also being read 
> from"
> related: https://issues.apache.org/jira/browse/SPARK-24194
> This work on spark 2.4.3 and do not work on spark 3.0.0
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32838) Connot overwite different partition with same table

2020-09-12 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194644#comment-17194644
 ] 

CHC commented on SPARK-32838:
-

After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:

 
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand.

 

 

> Connot overwite different partition with same table
> ---
>
> Key: SPARK-32838
> URL: https://issues.apache.org/jira/browse/SPARK-32838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: hadoop 2.7 + spark 3.0.0
>Reporter: CHC
>Priority: Major
>
> When:
> {code:java}
> CREATE TABLE tmp.spark3_snap (
> id string
> )
> PARTITIONED BY (dt string)
> STORED AS ORC
> ;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-09')
> select 10;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select 1;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select id from tmp.spark3_snap where dt='2020-09-09';
> {code}
> and it will be get a error: "Cannot overwrite a path that is also being read 
> from"
> related: https://issues.apache.org/jira/browse/SPARK-24194
> This work on spark 2.4.3 and do not work on spark 3.0.0
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32838) Connot overwite different partition with same table

2020-09-12 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194644#comment-17194644
 ] 

CHC edited comment on SPARK-32838 at 9/12/20, 7:47 AM:
---

After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand. 


was (Author: chenxchen):
After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand.

 

 

> Connot overwite different partition with same table
> ---
>
> Key: SPARK-32838
> URL: https://issues.apache.org/jira/browse/SPARK-32838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: hadoop 2.7 + spark 3.0.0
>Reporter: CHC
>Priority: Major
>
> When:
> {code:java}
> CREATE TABLE tmp.spark3_snap (
> id string
> )
> PARTITIONED BY (dt string)
> STORED AS ORC
> ;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-09')
> select 10;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select 1;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select id from tmp.spark3_snap where dt='2020-09-09';
> {code}
> and it will be get a error: "Cannot overwrite a path that is also being read 
> from"
> related: https://issues.apache.org/jira/browse/SPARK-24194
> This work on spark 2.4.3 and do not work on spark 3.0.0
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32862) Left semi stream-stream join

2020-09-12 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194643#comment-17194643
 ] 

Cheng Su commented on SPARK-32862:
--

Will raise a PR in next couple of weeks. Thanks.

> Left semi stream-stream join
> 
>
> Key: SPARK-32862
> URL: https://issues.apache.org/jira/browse/SPARK-32862
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). We do see internally a lot of users are using left semi stream-stream 
> join (not spark structured streaming), e.g. I want to get the ad impression 
> (join left side) which has click (joint right side), but I don't care how 
> many clicks per ad (left semi semantics).
>  
> Left semi stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store
>   (1.1). if there's a match, output the left side row.
>   (1.2). if there's no match, put the row in left side state store (with 
> "matched" field to set to false in state store).
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, update left side row state with "matched" field to 
> set to true. Put the right side row in right side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is true.
> (4).for right side row needs to be evicted from state store, doing nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32862) Left semi stream-stream join

2020-09-12 Thread Cheng Su (Jira)
Cheng Su created SPARK-32862:


 Summary: Left semi stream-stream join
 Key: SPARK-32862
 URL: https://issues.apache.org/jira/browse/SPARK-32862
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Cheng Su


Current stream-stream join supports inner, left outer and right outer join 
([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
 ). We do see internally a lot of users are using left semi stream-stream join 
(not spark structured streaming), e.g. I want to get the ad impression (join 
left side) which has click (joint right side), but I don't care how many clicks 
per ad (left semi semantics).

 

Left semi stream-stream join will work as followed:

(1).for left side input row, check if there's a match on right side state store

  (1.1). if there's a match, output the left side row.

  (1.2). if there's no match, put the row in left side state store (with 
"matched" field to set to false in state store).

(2).for right side input row, check if there's a match on left side state 
store. If there's a match, update left side row state with "matched" field to 
set to true. Put the right side row in right side state store.

(3).for left side row needs to be evicted from state store, output the row if 
"matched" field is true.

(4).for right side row needs to be evicted from state store, doing nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24528) Add support to read multiple sorted bucket files for data source v1

2020-09-12 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194505#comment-17194505
 ] 

Cheng Su edited comment on SPARK-24528 at 9/12/20, 7:28 AM:


After discussing with [~cloud_fan] , it would be better to have a rule to 
automatically decide whether to do bucket sorted scan based on query shape, and 
same for bucket scan. So I will first do bucket scan in 
https://issues.apache.org/jira/browse/SPARK-32859 , and redo this one after the 
PR for https://issues.apache.org/jira/browse/SPARK-32859 is merged.


was (Author: chengsu):
After discussing with [~cloud_fan] , it would be better to have a rule to 
automatically decide whether to do bucket sorted scan based on query shape, and 
same for bucket scan. So I will first do bucket scan in 
https://issues.apache.org/jira/browse/SPARK-24528, and redo this one after the 
PR for https://issues.apache.org/jira/browse/SPARK-24528 is merged.

> Add support to read multiple sorted bucket files for data source v1
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ohad Raviv
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-24528#Closely related to  
> SPARK-24410, we're trying to optimize a very common use case we have of 
> getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org