[jira] [Updated] (SPARK-36255) FileNotFound exceptions from the shuffle push can cause the executor to terminate

2021-07-21 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-36255:
--
Summary: FileNotFound exceptions from the shuffle push can cause the 
executor to terminate  (was: FileNotFound exceptions in the Shuffle-push-thread 
can cause the executor to fail)

> FileNotFound exceptions from the shuffle push can cause the executor to 
> terminate
> -
>
> Key: SPARK-36255
> URL: https://issues.apache.org/jira/browse/SPARK-36255
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> When the shuffle files are cleaned up by the executors once a job in a Spark 
> application completes, the push of the shuffle data by the executor can throw 
> FileNotFound exception. When this exception is thrown from the 
> {{shuffle-block-push-thread}}, it causes the executor to fail. This is 
> because of the default uncaught exception handler for Spark daemon threads 
> which terminates the executor when there are uncaught exceptions for the 
> daemon threads.
> {code:java}
> 21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[block-push-thread-1,5,main]
> java.lang.Error: java.io.IOException: Error in opening 
> FileSegmentManagedBuffer
> {file=/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
>  offset=10640, length=190}
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Error in opening 
> FileSegmentManagedBuffer\{file=***/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
>  offset=10640, length=190}
> at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
> at 
> org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
> at 
> org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
> at 
> org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
> at 
> org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ... 2 more
> Caused by: java.io.FileNotFoundException: 
> **/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data
>  (No such file or directory)
> at java.io.RandomAccessFile.open0(Native Method)
> at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
> at java.io.RandomAccessFile.(RandomAccessFile.java:243)
> at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62)
> {code}
> We can address the issue by handling "FileNotFound" exceptions in the push 
> threads and netty threads by stopping the push when {{FileNotFound}} is 
> encountered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36255) FileNotFound exceptions in the Shuffle-push-thread can cause the executor to fail

2021-07-21 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-36255:
--
Description: 
When the shuffle files are cleaned up by the executors once a job in a Spark 
application completes, the push of the shuffle data by the executor can throw 
FileNotFound exception. When this exception is thrown from the 
{{shuffle-block-push-thread}}, it causes the executor to fail. This is because 
of the default uncaught exception handler for Spark daemon threads which 
terminates the executor when there are uncaught exceptions for the daemon 
threads.
{code:java}
21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception 
in thread Thread[block-push-thread-1,5,main]
java.lang.Error: java.io.IOException: Error in opening FileSegmentManagedBuffer

{file=/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
 offset=10640, length=190}
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Error in opening 
FileSegmentManagedBuffer\{file=***/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
 offset=10640, length=190}

at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
at 
org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
at 
org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
at 
org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
at 
org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
Caused by: java.io.FileNotFoundException: 
**/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data
 (No such file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.(RandomAccessFile.java:243)
at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62)
{code}
We can address the issue by handling "FileNotFound" exceptions in the push 
threads and netty threads by stopping the push when {{FileNotFound}} is 
encountered.

  was:
When the shuffle files are cleaned up by the executors once a job in a Spark 
application completes, the push of the shuffle data by the executor can throw 
FileNotFound exception. When this exception is thrown from the 
{{shuffle-block-push-thread}}, it causes the executor to fail. This is because 
of the default uncaught exception handler for Spark daemon threads which 
terminates the executor when there are exceptions for the daemon threads.
{code:java}
21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception 
in thread Thread[block-push-thread-1,5,main]
java.lang.Error: java.io.IOException: Error in opening FileSegmentManagedBuffer

{file=/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
 offset=10640, length=190}
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Error in opening 
FileSegmentManagedBuffer\{file=***/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
 offset=10640, length=190}

at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
at 
org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
at 
org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
at 
org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
at 
org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
Caused by: java.io.FileNotFoundException: 
**/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data
 (No such file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.(RandomAccessFile.java:243)
at 

[jira] [Updated] (SPARK-36255) FileNotFound exceptions in the Shuffle-push-thread can cause the executor to fail

2021-07-21 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-36255:
--
Description: 
When the shuffle files are cleaned up by the executors once a job in a Spark 
application completes, the push of the shuffle data by the executor can throw 
FileNotFound exception. When this exception is thrown from the 
{{shuffle-block-push-thread}}, it causes the executor to fail. This is because 
of the default uncaught exception handler for Spark daemon threads which 
terminates the executor when there are exceptions for the daemon threads.
{code:java}
21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception 
in thread Thread[block-push-thread-1,5,main]
java.lang.Error: java.io.IOException: Error in opening FileSegmentManagedBuffer

{file=/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
 offset=10640, length=190}
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Error in opening 
FileSegmentManagedBuffer\{file=***/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
 offset=10640, length=190}

at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
at 
org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
at 
org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
at 
org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
at 
org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
Caused by: java.io.FileNotFoundException: 
**/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data
 (No such file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.(RandomAccessFile.java:243)
at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62)
{code}
We can address the issue by handling "FileNotFound" exceptions in the push 
threads and netty threads by stopping the push when {{FileNotFound}} is 
encountered.

  was:When the shuffle files are cleaned up the executors once a job completes, 
the push of the shuffle data will throw FileNotFound exceptions. This exception 
when thrown from the {{shuffle-block-push-thread}} still causes the executor to 
fail. 


> FileNotFound exceptions in the Shuffle-push-thread can cause the executor to 
> fail
> -
>
> Key: SPARK-36255
> URL: https://issues.apache.org/jira/browse/SPARK-36255
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> When the shuffle files are cleaned up by the executors once a job in a Spark 
> application completes, the push of the shuffle data by the executor can throw 
> FileNotFound exception. When this exception is thrown from the 
> {{shuffle-block-push-thread}}, it causes the executor to fail. This is 
> because of the default uncaught exception handler for Spark daemon threads 
> which terminates the executor when there are exceptions for the daemon 
> threads.
> {code:java}
> 21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[block-push-thread-1,5,main]
> java.lang.Error: java.io.IOException: Error in opening 
> FileSegmentManagedBuffer
> {file=/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
>  offset=10640, length=190}
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Error in opening 
> FileSegmentManagedBuffer\{file=***/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
>  offset=10640, length=190}
> at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
> at 
> org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
> at 
> 

[jira] [Resolved] (SPARK-36214) Add add_categories to CategoricalAccessor and CategoricalIndex.

2021-07-21 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36214.
---
Fix Version/s: 3.2.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Issue resolved by pull request 33470
https://github.com/apache/spark/pull/33470

> Add add_categories to CategoricalAccessor and CategoricalIndex.
> ---
>
> Key: SPARK-36214
> URL: https://issues.apache.org/jira/browse/SPARK-36214
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36255) FileNotFound exceptions in the Shuffle-push-thread can cause the executor to fail

2021-07-21 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-36255:
--
Parent: SPARK-30602
Issue Type: Sub-task  (was: Bug)

> FileNotFound exceptions in the Shuffle-push-thread can cause the executor to 
> fail
> -
>
> Key: SPARK-36255
> URL: https://issues.apache.org/jira/browse/SPARK-36255
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> When the shuffle files are cleaned up the executors once a job completes, the 
> push of the shuffle data will throw FileNotFound exceptions. This exception 
> when thrown from the {{shuffle-block-push-thread}} still causes the executor 
> to fail. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36255) FileNotFound exceptions in the Shuffle-push-thread can cause the executor to fail

2021-07-21 Thread Chandni Singh (Jira)
Chandni Singh created SPARK-36255:
-

 Summary: FileNotFound exceptions in the Shuffle-push-thread can 
cause the executor to fail
 Key: SPARK-36255
 URL: https://issues.apache.org/jira/browse/SPARK-36255
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.1.0
Reporter: Chandni Singh


When the shuffle files are cleaned up the executors once a job completes, the 
push of the shuffle data will throw FileNotFound exceptions. This exception 
when thrown from the {{shuffle-block-push-thread}} still causes the executor to 
fail. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36253) Document added version of pandas-on-Spark support

2021-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36253:
-
Fix Version/s: 3.2.0

> Document added version of pandas-on-Spark support
> -
>
> Key: SPARK-36253
> URL: https://issues.apache.org/jira/browse/SPARK-36253
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.2.0
>
>
> we should add when we added the support of pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36253) Document added version of pandas-on-Spark support

2021-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36253.
--
Resolution: Fixed

> Document added version of pandas-on-Spark support
> -
>
> Key: SPARK-36253
> URL: https://issues.apache.org/jira/browse/SPARK-36253
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.2.0
>
>
> we should add when we added the support of pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36253) Document added version of pandas-on-Spark support

2021-07-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385263#comment-17385263
 ] 

Hyukjin Kwon commented on SPARK-36253:
--

Fixed in https://github.com/apache/spark/pull/33473

> Document added version of pandas-on-Spark support
> -
>
> Key: SPARK-36253
> URL: https://issues.apache.org/jira/browse/SPARK-36253
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> we should add when we added the support of pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36254) Install mlflow and delta in Github Actions CI

2021-07-21 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-36254:
---

 Summary: Install mlflow and delta in Github Actions CI
 Key: SPARK-36254
 URL: https://issues.apache.org/jira/browse/SPARK-36254
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Haejoon Lee


Since the pandas-on-Spark includes the mlflow and delta features and related 
tests, we should install the mlflow and delta in our Github Actions CI so that 
the test won't be skipped from Spark 3.2.

 

We should also add a logic to check Spark version, to install the mlflow and 
delta only for the Spark 3.2 and above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36253) Document added version of pandas-on-Spark support

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385245#comment-17385245
 ] 

Apache Spark commented on SPARK-36253:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33473

> Document added version of pandas-on-Spark support
> -
>
> Key: SPARK-36253
> URL: https://issues.apache.org/jira/browse/SPARK-36253
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> we should add when we added the support of pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36253) Document added version of pandas-on-Spark support

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36253:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Document added version of pandas-on-Spark support
> -
>
> Key: SPARK-36253
> URL: https://issues.apache.org/jira/browse/SPARK-36253
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> we should add when we added the support of pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36253) Document added version of pandas-on-Spark support

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36253:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Document added version of pandas-on-Spark support
> -
>
> Key: SPARK-36253
> URL: https://issues.apache.org/jira/browse/SPARK-36253
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> we should add when we added the support of pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36253) Document added version of pandas-on-Spark support

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385243#comment-17385243
 ] 

Apache Spark commented on SPARK-36253:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33473

> Document added version of pandas-on-Spark support
> -
>
> Key: SPARK-36253
> URL: https://issues.apache.org/jira/browse/SPARK-36253
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> we should add when we added the support of pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36253) Document added version of pandas-on-Spark support

2021-07-21 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36253:


 Summary: Document added version of pandas-on-Spark support
 Key: SPARK-36253
 URL: https://issues.apache.org/jira/browse/SPARK-36253
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


we should add when we added the support of pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36252) Add log files rolling policy for driver running in cluster mode with spark standalone cluster

2021-07-21 Thread Jack Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Hu updated SPARK-36252:

Description: 
For a long running driver in cluster mode, there is no rolling policy, the log 
(stdout/stderr) may accupy lots of space, user needs an external tool to clean 
the old logs, it's not user friendly. 

For executor, following 5 configurations is used to control the log file 
rolling policy:
{code:java}
spark.executor.logs.rolling.maxRetainedFiles
spark.executor.logs.rolling.enableCompression
spark.executor.logs.rolling.maxSize
spark.executor.logs.rolling.strategy
spark.executor.logs.rolling.time.interval
{code}

For driver running in cluster mode:
1. reuse the executor settings
2. similar to executor: add following configurations (only works for 
stderr/stdout for driver in cluster mode)
{code:java}
spark.driver.logs.rolling.maxRetainedFiles
spark.driver.logs.rolling.enableCompression
spark.driver.logs.rolling.maxSize
spark.driver.logs.rolling.strategy
spark.driver.logs.rolling.time.interval
{code}

#2 seems better, do you agree?

  was:
For a long running driver in cluster mode, there is no rolling policy, the log 
(stdout/stderr) may accupy lots of space, user needs a external tool to clean 
the old logs, it's not friendly. 

For executor, following 5 configurations is used to control the log file 
rolling policy:
{code:java}
spark.executor.logs.rolling.maxRetainedFiles
spark.executor.logs.rolling.enableCompression
spark.executor.logs.rolling.maxSize
spark.executor.logs.rolling.strategy
spark.executor.logs.rolling.time.interval
{code}

For driver running in cluster mode:
1. reuse the executor settings
2. similar to executor: add following configurations (only works for 
stderr/stdout for driver in cluster mode)
{code:java}
spark.driver.logs.rolling.maxRetainedFiles
spark.driver.logs.rolling.enableCompression
spark.driver.logs.rolling.maxSize
spark.driver.logs.rolling.strategy
spark.driver.logs.rolling.time.interval
{code}

#2 seems better, do you agree?


> Add log files rolling policy for driver running in cluster mode with spark 
> standalone cluster
> -
>
> Key: SPARK-36252
> URL: https://issues.apache.org/jira/browse/SPARK-36252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Jack Hu
>Priority: Major
>
> For a long running driver in cluster mode, there is no rolling policy, the 
> log (stdout/stderr) may accupy lots of space, user needs an external tool to 
> clean the old logs, it's not user friendly. 
> For executor, following 5 configurations is used to control the log file 
> rolling policy:
> {code:java}
> spark.executor.logs.rolling.maxRetainedFiles
> spark.executor.logs.rolling.enableCompression
> spark.executor.logs.rolling.maxSize
> spark.executor.logs.rolling.strategy
> spark.executor.logs.rolling.time.interval
> {code}
> For driver running in cluster mode:
> 1. reuse the executor settings
> 2. similar to executor: add following configurations (only works for 
> stderr/stdout for driver in cluster mode)
> {code:java}
> spark.driver.logs.rolling.maxRetainedFiles
> spark.driver.logs.rolling.enableCompression
> spark.driver.logs.rolling.maxSize
> spark.driver.logs.rolling.strategy
> spark.driver.logs.rolling.time.interval
> {code}
> #2 seems better, do you agree?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36252) Add log files rolling policy for driver running in cluster mode with spark standalone cluster

2021-07-21 Thread Jack Hu (Jira)
Jack Hu created SPARK-36252:
---

 Summary: Add log files rolling policy for driver running in 
cluster mode with spark standalone cluster
 Key: SPARK-36252
 URL: https://issues.apache.org/jira/browse/SPARK-36252
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Jack Hu


For a long running driver in cluster mode, there is no rolling policy, the log 
(stdout/stderr) may accupy lots of space, user needs a external tool to clean 
the old logs, it's not friendly. 

For executor, following 5 configurations is used to control the log file 
rolling policy:
{code:java}
spark.executor.logs.rolling.maxRetainedFiles
spark.executor.logs.rolling.enableCompression
spark.executor.logs.rolling.maxSize
spark.executor.logs.rolling.strategy
spark.executor.logs.rolling.time.interval
{code}

For driver running in cluster mode:
1. reuse the executor settings
2. similar to executor: add following configurations (only works for 
stderr/stdout for driver in cluster mode)
{code:java}
spark.driver.logs.rolling.maxRetainedFiles
spark.driver.logs.rolling.enableCompression
spark.driver.logs.rolling.maxSize
spark.driver.logs.rolling.strategy
spark.driver.logs.rolling.time.interval
{code}

#2 seems better, do you agree?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36063) Optimize OneRowRelation subqueries

2021-07-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36063.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33284
[https://github.com/apache/spark/pull/33284]

> Optimize OneRowRelation subqueries
> --
>
> Key: SPARK-36063
> URL: https://issues.apache.org/jira/browse/SPARK-36063
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Inline subqueries with OneRowRelation as leaf nodes instead of decorrelating 
> and rewriting them as left outer joins.
> Scalar subquery:
>  ```
>  SELECT (SELECT c1) FROM t1 -> SELECT c1 FROM t1
>  ```
> Lateral subquery:
>  ```
>  SELECT * FROM t1, LATERAL (SELECT c1, c2) -> SELECT c1, c2 , c1, c2 FROM t1
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36063) Optimize OneRowRelation subqueries

2021-07-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36063:
---

Assignee: Allison Wang

> Optimize OneRowRelation subqueries
> --
>
> Key: SPARK-36063
> URL: https://issues.apache.org/jira/browse/SPARK-36063
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Inline subqueries with OneRowRelation as leaf nodes instead of decorrelating 
> and rewriting them as left outer joins.
> Scalar subquery:
>  ```
>  SELECT (SELECT c1) FROM t1 -> SELECT c1 FROM t1
>  ```
> Lateral subquery:
>  ```
>  SELECT * FROM t1, LATERAL (SELECT c1, c2) -> SELECT c1, c2 , c1, c2 FROM t1
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36244) Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation

2021-07-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36244.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33464
[https://github.com/apache/spark/pull/33464]

> Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation
> 
>
> Key: SPARK-36244
> URL: https://issues.apache.org/jira/browse/SPARK-36244
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> zstd-jni 1.5.0-3 was released few days ago.
> This release resolves an issue about buffer size calculation, which can 
> affect usage in Spark.
> https://github.com/luben/zstd-jni/releases/tag/v1.5.0-3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35912) [SQL] JSON read behavior is different depending on the cache setting when nullable is false.

2021-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35912.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33436
[https://github.com/apache/spark/pull/33436]

> [SQL] JSON read behavior is different depending on the cache setting when 
> nullable is false.
> 
>
> Key: SPARK-35912
> URL: https://issues.apache.org/jira/browse/SPARK-35912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Heedo Lee
>Assignee: Fu Chen
>Priority: Minor
> Fix For: 3.3.0
>
>
> Below is the reproduced code.
>  
> {code:java}
> import org.apache.spark.sql.Encoders
>  
> case class TestSchema(x: Int, y: Int)
> case class BaseSchema(value: TestSchema)
>  
> val schema = Encoders.product[BaseSchema].schema
> val testDS = Seq("""{"value":{"x":1}}""", """{"value":{"x":2}}""").toDS
> val jsonDS = spark.read.schema(schema).json(testDS)
> jsonDS.show
> +-+
> |value|
> +-+
> |{1, null}|
> |{2, null}|
> +-+
> jsonDS.cache.show
> +--+
> | value|
> +--+
> |{1, 0}|
> |{2, 0}|
> +--+
> {code}
>  
> The above result occurs when a schema is created with a nested StructType and 
> nullable of StructField is false.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35912) [SQL] JSON read behavior is different depending on the cache setting when nullable is false.

2021-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35912:


Assignee: Fu Chen

> [SQL] JSON read behavior is different depending on the cache setting when 
> nullable is false.
> 
>
> Key: SPARK-35912
> URL: https://issues.apache.org/jira/browse/SPARK-35912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Heedo Lee
>Assignee: Fu Chen
>Priority: Minor
>
> Below is the reproduced code.
>  
> {code:java}
> import org.apache.spark.sql.Encoders
>  
> case class TestSchema(x: Int, y: Int)
> case class BaseSchema(value: TestSchema)
>  
> val schema = Encoders.product[BaseSchema].schema
> val testDS = Seq("""{"value":{"x":1}}""", """{"value":{"x":2}}""").toDS
> val jsonDS = spark.read.schema(schema).json(testDS)
> jsonDS.show
> +-+
> |value|
> +-+
> |{1, null}|
> |{2, null}|
> +-+
> jsonDS.cache.show
> +--+
> | value|
> +--+
> |{1, 0}|
> |{2, 0}|
> +--+
> {code}
>  
> The above result occurs when a schema is created with a nested StructType and 
> nullable of StructField is false.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32666) Install ipython and nbsphinx in Jenkins for Binder integration

2021-07-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385176#comment-17385176
 ] 

Hyukjin Kwon commented on SPARK-32666:
--

Thanks [~shaneknapp]!!!

> Install ipython and nbsphinx in Jenkins for Binder integration
> --
>
> Key: SPARK-32666
> URL: https://issues.apache.org/jira/browse/SPARK-32666
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
> Fix For: 3.3.0
>
>
> Binder integration requires IPython and nbsphinx to use the notebook file as 
> the documentation in PySpark.
> See SPARK-32204 and its PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36251) Cover GitHub Actions runs without SHA in testing script

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385174#comment-17385174
 ] 

Apache Spark commented on SPARK-36251:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33472

> Cover GitHub Actions runs without SHA in testing script
> ---
>
> Key: SPARK-36251
> URL: https://issues.apache.org/jira/browse/SPARK-36251
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-36204 added the periodical jobs for branch-3.2 too but the job runs 
> without SHA being set.
> The test script should be able to handle this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36251) Cover GitHub Actions runs without SHA in testing script

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36251:


Assignee: (was: Apache Spark)

> Cover GitHub Actions runs without SHA in testing script
> ---
>
> Key: SPARK-36251
> URL: https://issues.apache.org/jira/browse/SPARK-36251
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-36204 added the periodical jobs for branch-3.2 too but the job runs 
> without SHA being set.
> The test script should be able to handle this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36251) Cover GitHub Actions runs without SHA in testing script

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385173#comment-17385173
 ] 

Apache Spark commented on SPARK-36251:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33472

> Cover GitHub Actions runs without SHA in testing script
> ---
>
> Key: SPARK-36251
> URL: https://issues.apache.org/jira/browse/SPARK-36251
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-36204 added the periodical jobs for branch-3.2 too but the job runs 
> without SHA being set.
> The test script should be able to handle this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36251) Cover GitHub Actions runs without SHA in testing script

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36251:


Assignee: Apache Spark

> Cover GitHub Actions runs without SHA in testing script
> ---
>
> Key: SPARK-36251
> URL: https://issues.apache.org/jira/browse/SPARK-36251
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-36204 added the periodical jobs for branch-3.2 too but the job runs 
> without SHA being set.
> The test script should be able to handle this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36251) Cover GitHub Actions runs without SHA in testing script

2021-07-21 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36251:


 Summary: Cover GitHub Actions runs without SHA in testing script
 Key: SPARK-36251
 URL: https://issues.apache.org/jira/browse/SPARK-36251
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


SPARK-36204 added the periodical jobs for branch-3.2 too but the job runs 
without SHA being set.

The test script should be able to handle this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36250) Add support for running make-distribution without a "clean"

2021-07-21 Thread Holden Karau (Jira)
Holden Karau created SPARK-36250:


 Summary: Add support for running make-distribution without a 
"clean"
 Key: SPARK-36250
 URL: https://issues.apache.org/jira/browse/SPARK-36250
 Project: Spark
  Issue Type: Improvement
  Components: Build, Kubernetes
Affects Versions: 3.2.0, 3.3.0
Reporter: Holden Karau


Running the K8s integration tests requires* building a distribution, but clean 
builds are really really slow. We could make the param BUILD_COMMAND set if 
unset or add a --skip-clean to our shell script to allow for folks to more 
quickly test there k8s related changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36248) Add rename_categories to CategoricalAccessor and CategoricalIndex

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36248:


Assignee: Apache Spark

> Add rename_categories to CategoricalAccessor and CategoricalIndex
> -
>
> Key: SPARK-36248
> URL: https://issues.apache.org/jira/browse/SPARK-36248
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Add rename_categories to CategoricalAccessor and CategoricalIndex to follow 
> pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36248) Add rename_categories to CategoricalAccessor and CategoricalIndex

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385161#comment-17385161
 ] 

Apache Spark commented on SPARK-36248:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33471

> Add rename_categories to CategoricalAccessor and CategoricalIndex
> -
>
> Key: SPARK-36248
> URL: https://issues.apache.org/jira/browse/SPARK-36248
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Add rename_categories to CategoricalAccessor and CategoricalIndex to follow 
> pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36248) Add rename_categories to CategoricalAccessor and CategoricalIndex

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36248:


Assignee: (was: Apache Spark)

> Add rename_categories to CategoricalAccessor and CategoricalIndex
> -
>
> Key: SPARK-36248
> URL: https://issues.apache.org/jira/browse/SPARK-36248
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Add rename_categories to CategoricalAccessor and CategoricalIndex to follow 
> pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33242) Install numpydoc in Jenkins machines

2021-07-21 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-33242.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33469
[https://github.com/apache/spark/pull/33469]

> Install numpydoc in Jenkins machines
> 
>
> Key: SPARK-33242
> URL: https://issues.apache.org/jira/browse/SPARK-33242
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
> Fix For: 3.3.0
>
>
> To switch to reST style to numpydoc style, we should install numpydoc as 
> well. This is being used in Sphinx. See the parent JIRA as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32391) Install pydata_sphinx_theme in Jenkins machines

2021-07-21 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-32391.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33469
[https://github.com/apache/spark/pull/33469]

> Install pydata_sphinx_theme in Jenkins machines
> ---
>
> Key: SPARK-32391
> URL: https://issues.apache.org/jira/browse/SPARK-32391
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.0.1
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
> Fix For: 3.3.0
>
>
> After SPARK-32179, {{pydata_sphinx_theme}} 
> https://pypi.org/project/pydata-sphinx-theme/ is needed as a new Python 
> dependency for PySpark documentation build.
> We should install it in Jenkins to test PySpark documentation build in Python 
> 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32666) Install ipython and nbsphinx in Jenkins for Binder integration

2021-07-21 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-32666.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33469
[https://github.com/apache/spark/pull/33469]

> Install ipython and nbsphinx in Jenkins for Binder integration
> --
>
> Key: SPARK-32666
> URL: https://issues.apache.org/jira/browse/SPARK-32666
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
> Fix For: 3.3.0
>
>
> Binder integration requires IPython and nbsphinx to use the notebook file as 
> the documentation in PySpark.
> See SPARK-32204 and its PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32797) Install mypy on the Jenkins CI workers

2021-07-21 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-32797.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33469
[https://github.com/apache/spark/pull/33469]

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.1.0
>Reporter: Fokko Driesprong
>Assignee: Shane Knapp
>Priority: Major
> Fix For: 3.3.0
>
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32666) Install ipython and nbsphinx in Jenkins for Binder integration

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385132#comment-17385132
 ] 

Apache Spark commented on SPARK-32666:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install ipython and nbsphinx in Jenkins for Binder integration
> --
>
> Key: SPARK-32666
> URL: https://issues.apache.org/jira/browse/SPARK-32666
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Binder integration requires IPython and nbsphinx to use the notebook file as 
> the documentation in PySpark.
> See SPARK-32204 and its PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32666) Install ipython and nbsphinx in Jenkins for Binder integration

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385133#comment-17385133
 ] 

Apache Spark commented on SPARK-32666:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install ipython and nbsphinx in Jenkins for Binder integration
> --
>
> Key: SPARK-32666
> URL: https://issues.apache.org/jira/browse/SPARK-32666
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Binder integration requires IPython and nbsphinx to use the notebook file as 
> the documentation in PySpark.
> See SPARK-32204 and its PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32666) Install ipython and nbsphinx in Jenkins for Binder integration

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32666:


Assignee: Shane Knapp  (was: Apache Spark)

> Install ipython and nbsphinx in Jenkins for Binder integration
> --
>
> Key: SPARK-32666
> URL: https://issues.apache.org/jira/browse/SPARK-32666
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Binder integration requires IPython and nbsphinx to use the notebook file as 
> the documentation in PySpark.
> See SPARK-32204 and its PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32666) Install ipython and nbsphinx in Jenkins for Binder integration

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385131#comment-17385131
 ] 

Apache Spark commented on SPARK-32666:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install ipython and nbsphinx in Jenkins for Binder integration
> --
>
> Key: SPARK-32666
> URL: https://issues.apache.org/jira/browse/SPARK-32666
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Binder integration requires IPython and nbsphinx to use the notebook file as 
> the documentation in PySpark.
> See SPARK-32204 and its PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33242) Install numpydoc in Jenkins machines

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385130#comment-17385130
 ] 

Apache Spark commented on SPARK-33242:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install numpydoc in Jenkins machines
> 
>
> Key: SPARK-33242
> URL: https://issues.apache.org/jira/browse/SPARK-33242
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> To switch to reST style to numpydoc style, we should install numpydoc as 
> well. This is being used in Sphinx. See the parent JIRA as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32666) Install ipython and nbsphinx in Jenkins for Binder integration

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32666:


Assignee: Apache Spark  (was: Shane Knapp)

> Install ipython and nbsphinx in Jenkins for Binder integration
> --
>
> Key: SPARK-32666
> URL: https://issues.apache.org/jira/browse/SPARK-32666
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Binder integration requires IPython and nbsphinx to use the notebook file as 
> the documentation in PySpark.
> See SPARK-32204 and its PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32666) Install ipython and nbsphinx in Jenkins for Binder integration

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32666:


Assignee: Shane Knapp  (was: Apache Spark)

> Install ipython and nbsphinx in Jenkins for Binder integration
> --
>
> Key: SPARK-32666
> URL: https://issues.apache.org/jira/browse/SPARK-32666
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Binder integration requires IPython and nbsphinx to use the notebook file as 
> the documentation in PySpark.
> See SPARK-32204 and its PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33242) Install numpydoc in Jenkins machines

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385129#comment-17385129
 ] 

Apache Spark commented on SPARK-33242:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install numpydoc in Jenkins machines
> 
>
> Key: SPARK-33242
> URL: https://issues.apache.org/jira/browse/SPARK-33242
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> To switch to reST style to numpydoc style, we should install numpydoc as 
> well. This is being used in Sphinx. See the parent JIRA as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33242) Install numpydoc in Jenkins machines

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33242:


Assignee: Shane Knapp  (was: Apache Spark)

> Install numpydoc in Jenkins machines
> 
>
> Key: SPARK-33242
> URL: https://issues.apache.org/jira/browse/SPARK-33242
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> To switch to reST style to numpydoc style, we should install numpydoc as 
> well. This is being used in Sphinx. See the parent JIRA as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33242) Install numpydoc in Jenkins machines

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385128#comment-17385128
 ] 

Apache Spark commented on SPARK-33242:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install numpydoc in Jenkins machines
> 
>
> Key: SPARK-33242
> URL: https://issues.apache.org/jira/browse/SPARK-33242
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> To switch to reST style to numpydoc style, we should install numpydoc as 
> well. This is being used in Sphinx. See the parent JIRA as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33242) Install numpydoc in Jenkins machines

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33242:


Assignee: Apache Spark  (was: Shane Knapp)

> Install numpydoc in Jenkins machines
> 
>
> Key: SPARK-33242
> URL: https://issues.apache.org/jira/browse/SPARK-33242
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> To switch to reST style to numpydoc style, we should install numpydoc as 
> well. This is being used in Sphinx. See the parent JIRA as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32391) Install pydata_sphinx_theme in Jenkins machines

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385127#comment-17385127
 ] 

Apache Spark commented on SPARK-32391:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install pydata_sphinx_theme in Jenkins machines
> ---
>
> Key: SPARK-32391
> URL: https://issues.apache.org/jira/browse/SPARK-32391
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.0.1
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> After SPARK-32179, {{pydata_sphinx_theme}} 
> https://pypi.org/project/pydata-sphinx-theme/ is needed as a new Python 
> dependency for PySpark documentation build.
> We should install it in Jenkins to test PySpark documentation build in Python 
> 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32391) Install pydata_sphinx_theme in Jenkins machines

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32391:


Assignee: Apache Spark  (was: Shane Knapp)

> Install pydata_sphinx_theme in Jenkins machines
> ---
>
> Key: SPARK-32391
> URL: https://issues.apache.org/jira/browse/SPARK-32391
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.0.1
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-32179, {{pydata_sphinx_theme}} 
> https://pypi.org/project/pydata-sphinx-theme/ is needed as a new Python 
> dependency for PySpark documentation build.
> We should install it in Jenkins to test PySpark documentation build in Python 
> 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32391) Install pydata_sphinx_theme in Jenkins machines

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32391:


Assignee: Shane Knapp  (was: Apache Spark)

> Install pydata_sphinx_theme in Jenkins machines
> ---
>
> Key: SPARK-32391
> URL: https://issues.apache.org/jira/browse/SPARK-32391
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.0.1
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> After SPARK-32179, {{pydata_sphinx_theme}} 
> https://pypi.org/project/pydata-sphinx-theme/ is needed as a new Python 
> dependency for PySpark documentation build.
> We should install it in Jenkins to test PySpark documentation build in Python 
> 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32797) Install mypy on the Jenkins CI workers

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385124#comment-17385124
 ] 

Apache Spark commented on SPARK-32797:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.1.0
>Reporter: Fokko Driesprong
>Assignee: Shane Knapp
>Priority: Major
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32391) Install pydata_sphinx_theme in Jenkins machines

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385125#comment-17385125
 ] 

Apache Spark commented on SPARK-32391:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install pydata_sphinx_theme in Jenkins machines
> ---
>
> Key: SPARK-32391
> URL: https://issues.apache.org/jira/browse/SPARK-32391
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.0.1
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> After SPARK-32179, {{pydata_sphinx_theme}} 
> https://pypi.org/project/pydata-sphinx-theme/ is needed as a new Python 
> dependency for PySpark documentation build.
> We should install it in Jenkins to test PySpark documentation build in Python 
> 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32391) Install pydata_sphinx_theme in Jenkins machines

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385126#comment-17385126
 ] 

Apache Spark commented on SPARK-32391:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install pydata_sphinx_theme in Jenkins machines
> ---
>
> Key: SPARK-32391
> URL: https://issues.apache.org/jira/browse/SPARK-32391
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.0.1
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> After SPARK-32179, {{pydata_sphinx_theme}} 
> https://pypi.org/project/pydata-sphinx-theme/ is needed as a new Python 
> dependency for PySpark documentation build.
> We should install it in Jenkins to test PySpark documentation build in Python 
> 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32797) Install mypy on the Jenkins CI workers

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32797:


Assignee: Apache Spark  (was: Shane Knapp)

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.1.0
>Reporter: Fokko Driesprong
>Assignee: Apache Spark
>Priority: Major
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32797) Install mypy on the Jenkins CI workers

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32797:


Assignee: Shane Knapp  (was: Apache Spark)

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.1.0
>Reporter: Fokko Driesprong
>Assignee: Shane Knapp
>Priority: Major
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32797) Install mypy on the Jenkins CI workers

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385123#comment-17385123
 ] 

Apache Spark commented on SPARK-32797:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/33469

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.1.0
>Reporter: Fokko Driesprong
>Assignee: Shane Knapp
>Priority: Major
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing

2021-07-21 Thread Ashish Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385071#comment-17385071
 ] 

Ashish Singh commented on SPARK-31162:
--

This is needed for reasons other than supporting hive bucketing write. For 
example, this is also needed to make sure custom partitioners from Hive (using 
Hive udf) can partition similar to hive.

Assigning it to myself, but let me know if you are working on this already 
[~maropu].

> Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
> -
>
> Key: SPARK-31162
> URL: https://issues.apache.org/jira/browse/SPARK-31162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I couldn't find a configuration parameter to choose Hive Hashing instead of 
> Spark's default Murmur Hash when performing Spark BucketBy operation. 
> According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to 
> open a new JIRA. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36249) Add remove_categories to CategoricalAccessor and CategoricalIndex

2021-07-21 Thread Takuya Ueshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385069#comment-17385069
 ] 

Takuya Ueshin commented on SPARK-36249:
---

I'm working on this.

> Add remove_categories to CategoricalAccessor and CategoricalIndex
> -
>
> Key: SPARK-36249
> URL: https://issues.apache.org/jira/browse/SPARK-36249
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36249) Add remove_categories to CategoricalAccessor and CategoricalIndex

2021-07-21 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36249:
-

 Summary: Add remove_categories to CategoricalAccessor and 
CategoricalIndex
 Key: SPARK-36249
 URL: https://issues.apache.org/jira/browse/SPARK-36249
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36214) Add add_categories to CategoricalAccessor and CategoricalIndex.

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385067#comment-17385067
 ] 

Apache Spark commented on SPARK-36214:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33470

> Add add_categories to CategoricalAccessor and CategoricalIndex.
> ---
>
> Key: SPARK-36214
> URL: https://issues.apache.org/jira/browse/SPARK-36214
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36214) Add add_categories to CategoricalAccessor and CategoricalIndex.

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36214:


Assignee: (was: Apache Spark)

> Add add_categories to CategoricalAccessor and CategoricalIndex.
> ---
>
> Key: SPARK-36214
> URL: https://issues.apache.org/jira/browse/SPARK-36214
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36214) Add add_categories to CategoricalAccessor and CategoricalIndex.

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36214:


Assignee: Apache Spark

> Add add_categories to CategoricalAccessor and CategoricalIndex.
> ---
>
> Key: SPARK-36214
> URL: https://issues.apache.org/jira/browse/SPARK-36214
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35546) Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in a better way

2021-07-21 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-35546:
---

Assignee: Ye Zhou

> Enable push-based shuffle when multiple app attempts are enabled and manage 
> concurrent access to the state in a better way 
> ---
>
> Key: SPARK-35546
> URL: https://issues.apache.org/jira/browse/SPARK-35546
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Ye Zhou
>Assignee: Ye Zhou
>Priority: Major
> Fix For: 3.2.0
>
>
> In the current implementation of RemoteBlockPushResolver, two 
> ConcurrentHashmap are used to store #1 applicationId -> 
> mergedShuffleLocalDirPath #2 applicationId+attemptId+shuffleID -> 
> mergedShuffleParitionInfo. As there are four types of messages: 
> ExecutorRegister, PushBlocks, FinalizeShuffleMerge and ApplicationRemove, 
> will trigger different types of operations within these two hashmaps, it is 
> required to maintain strong consistency about the informations stored in 
> these two hashmaps. Otherwise, either there will be data 
> corruption/correctness issues or memory leak in shuffle server. 
> We should come up with systematic way to resolve this, other than spot fixing 
> the potential issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36248) Add rename_categories to CategoricalAccessor and CategoricalIndex

2021-07-21 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385063#comment-17385063
 ] 

Xinrong Meng commented on SPARK-36248:
--

I'm working on this.

> Add rename_categories to CategoricalAccessor and CategoricalIndex
> -
>
> Key: SPARK-36248
> URL: https://issues.apache.org/jira/browse/SPARK-36248
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Add rename_categories to CategoricalAccessor and CategoricalIndex to follow 
> pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36248) Add rename_categories to CategoricalAccessor and CategoricalIndex

2021-07-21 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36248:


 Summary: Add rename_categories to CategoricalAccessor and 
CategoricalIndex
 Key: SPARK-36248
 URL: https://issues.apache.org/jira/browse/SPARK-36248
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


Add rename_categories to CategoricalAccessor and CategoricalIndex to follow 
pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36188) Add categories setter to CategoricalAccessor and CategoricalIndex.

2021-07-21 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36188.
---
Fix Version/s: 3.2.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Issue resolved by pull request 33448
https://github.com/apache/spark/pull/33448

> Add categories setter to CategoricalAccessor and CategoricalIndex.
> --
>
> Key: SPARK-36188
> URL: https://issues.apache.org/jira/browse/SPARK-36188
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36247) check string length for char/varchar in UPDATE/MERGE command

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36247:


Assignee: (was: Apache Spark)

> check string length for char/varchar in UPDATE/MERGE command
> 
>
> Key: SPARK-36247
> URL: https://issues.apache.org/jira/browse/SPARK-36247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36247) check string length for char/varchar in UPDATE/MERGE command

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36247:


Assignee: Apache Spark

> check string length for char/varchar in UPDATE/MERGE command
> 
>
> Key: SPARK-36247
> URL: https://issues.apache.org/jira/browse/SPARK-36247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36247) check string length for char/varchar in UPDATE/MERGE command

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385045#comment-17385045
 ] 

Apache Spark commented on SPARK-36247:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33468

> check string length for char/varchar in UPDATE/MERGE command
> 
>
> Key: SPARK-36247
> URL: https://issues.apache.org/jira/browse/SPARK-36247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36142) Adjust exponentiation between Series with missing values and bool literal to follow pandas

2021-07-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36142:
-
Summary: Adjust exponentiation between Series with missing values and bool 
literal to follow pandas  (was: Adjust exponentiation between ExtentionDtypes 
and bools to follow pandas)

> Adjust exponentiation between Series with missing values and bool literal to 
> follow pandas
> --
>
> Key: SPARK-36142
> URL: https://issues.apache.org/jira/browse/SPARK-36142
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, exponentiation between ExtentionDtypes and bools is not consistent 
> with pandas' behavior.
>  
> {code:java}
>  >>> pser = pd.Series([1, 2, np.nan], dtype=float)
>  >>> psser = ps.from_pandas(pser)
>  >>> pser ** False
>  0 1.0
>  1 1.0
>  2 1.0
>  dtype: float64
>  >>> psser ** False
>  0 1.0
>  1 1.0
>  2 NaN
>  dtype: float64
> {code}
> We ought to adjust that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32391) Install pydata_sphinx_theme in Jenkins machines

2021-07-21 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385044#comment-17385044
 ] 

Shane Knapp commented on SPARK-32391:
-

anyways, i installed this via conda and will roll out to all workers later this 
week.  :)

> Install pydata_sphinx_theme in Jenkins machines
> ---
>
> Key: SPARK-32391
> URL: https://issues.apache.org/jira/browse/SPARK-32391
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.0.1
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> After SPARK-32179, {{pydata_sphinx_theme}} 
> https://pypi.org/project/pydata-sphinx-theme/ is needed as a new Python 
> dependency for PySpark documentation build.
> We should install it in Jenkins to test PySpark documentation build in Python 
> 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36247) check string length for char/varchar in UPDATE/MERGE command

2021-07-21 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-36247:
---

 Summary: check string length for char/varchar in UPDATE/MERGE 
command
 Key: SPARK-36247
 URL: https://issues.apache.org/jira/browse/SPARK-36247
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34930) Install PyArrow and pandas on Jenkins

2021-07-21 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385040#comment-17385040
 ] 

Shane Knapp commented on SPARK-34930:
-

oh yeah, a LOT of those skipped tests are for pypy3, not python3.6

> Install PyArrow and pandas on Jenkins
> -
>
> Key: SPARK-34930
> URL: https://issues.apache.org/jira/browse/SPARK-34930
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Critical
>
> Looks like Jenkins mahcines don't have pandas and PyArrow (ever since it got 
> upgraded?) which result in skipping related tests in PySpark, see also 
> https://github.com/apache/spark/pull/31470#issuecomment-811618571
> It would be great if we can install both in Python 3.6 on Jenkins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32797) Install mypy on the Jenkins CI workers

2021-07-21 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385038#comment-17385038
 ] 

Shane Knapp commented on SPARK-32797:
-

ill roll this out (and other python package updates) later today/this week.

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.1.0
>Reporter: Fokko Driesprong
>Assignee: Shane Knapp
>Priority: Major
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36246) WorkerDecommissionExtendedSuite flakes with GHA

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36246:


Assignee: Holden Karau  (was: Apache Spark)

> WorkerDecommissionExtendedSuite flakes with GHA
> ---
>
> Key: SPARK-36246
> URL: https://issues.apache.org/jira/browse/SPARK-36246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36246) WorkerDecommissionExtendedSuite flakes with GHA

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385035#comment-17385035
 ] 

Apache Spark commented on SPARK-36246:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33467

> WorkerDecommissionExtendedSuite flakes with GHA
> ---
>
> Key: SPARK-36246
> URL: https://issues.apache.org/jira/browse/SPARK-36246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36246) WorkerDecommissionExtendedSuite flakes with GHA

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36246:


Assignee: Apache Spark  (was: Holden Karau)

> WorkerDecommissionExtendedSuite flakes with GHA
> ---
>
> Key: SPARK-36246
> URL: https://issues.apache.org/jira/browse/SPARK-36246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6

2021-07-21 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-29183.
-
Resolution: Fixed

this is done and all java11 installs are at 11.0.10

> Upgrade JDK 11 Installation to 11.0.6
> -
>
> Key: SPARK-29183
> URL: https://issues.apache.org/jira/browse/SPARK-29183
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> Every JDK 11.0.x releases have many fixes including performance regression 
> fix. We had better upgrade it to the latest 11.0.4.
> - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34930) Install PyArrow and pandas on Jenkins

2021-07-21 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385030#comment-17385030
 ] 

Shane Knapp commented on SPARK-34930:
-

pandas is installed, so i'm a little curious as to why the tests aren't running:
{noformat}
jenkins@research-jenkins-worker-01:~$ python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'0.24.2'
>>>{noformat}
pyarrow is a much more complex install than just adding the package, and 
requires manual compilation.  i'll revisit pyarrow in the next couple of 
weeks...

> Install PyArrow and pandas on Jenkins
> -
>
> Key: SPARK-34930
> URL: https://issues.apache.org/jira/browse/SPARK-34930
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Critical
>
> Looks like Jenkins mahcines don't have pandas and PyArrow (ever since it got 
> upgraded?) which result in skipping related tests in PySpark, see also 
> https://github.com/apache/spark/pull/31470#issuecomment-811618571
> It would be great if we can install both in Python 3.6 on Jenkins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32391) Install pydata_sphinx_theme in Jenkins machines

2021-07-21 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385026#comment-17385026
 ] 

Shane Knapp commented on SPARK-32391:
-

[~hyukjin.kwon] i am able to install this via conda...  any particular reason 
why you're requesting this through pip?

 

pydata-sphinx-theme-0.6.3 | pyhd8ed1ab_0 1.3 MB conda-forge

> Install pydata_sphinx_theme in Jenkins machines
> ---
>
> Key: SPARK-32391
> URL: https://issues.apache.org/jira/browse/SPARK-32391
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.0.1
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> After SPARK-32179, {{pydata_sphinx_theme}} 
> https://pypi.org/project/pydata-sphinx-theme/ is needed as a new Python 
> dependency for PySpark documentation build.
> We should install it in Jenkins to test PySpark documentation build in Python 
> 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32666) Install ipython and nbsphinx in Jenkins for Binder integration

2021-07-21 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385021#comment-17385021
 ] 

Shane Knapp commented on SPARK-32666:
-

ill roll this out (and other python package updates) later today/this week.

> Install ipython and nbsphinx in Jenkins for Binder integration
> --
>
> Key: SPARK-32666
> URL: https://issues.apache.org/jira/browse/SPARK-32666
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Binder integration requires IPython and nbsphinx to use the notebook file as 
> the documentation in PySpark.
> See SPARK-32204 and its PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33242) Install numpydoc in Jenkins machines

2021-07-21 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385022#comment-17385022
 ] 

Shane Knapp commented on SPARK-33242:
-

ill roll this out (and other python package updates) later today/this week.

> Install numpydoc in Jenkins machines
> 
>
> Key: SPARK-33242
> URL: https://issues.apache.org/jira/browse/SPARK-33242
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> To switch to reST style to numpydoc style, we should install numpydoc as 
> well. This is being used in Sphinx. See the parent JIRA as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36246) WorkerDecommissionExtendedSuite flakes with GHA

2021-07-21 Thread Holden Karau (Jira)
Holden Karau created SPARK-36246:


 Summary: WorkerDecommissionExtendedSuite flakes with GHA
 Key: SPARK-36246
 URL: https://issues.apache.org/jira/browse/SPARK-36246
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 3.3.0
Reporter: Holden Karau
Assignee: Holden Karau






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36143) Adjust astype of Series with missing values to follow pandas

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36143:


Assignee: (was: Apache Spark)

> Adjust astype of Series with missing values to follow pandas
> 
>
> Key: SPARK-36143
> URL: https://issues.apache.org/jira/browse/SPARK-36143
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> >>> pser = pd.Series([1, 2, np.nan], dtype=float)
> >>> psser = ps.from_pandas(pser)
> >>> pser.astype(int)
> ...
>  ValueError: Cannot convert non-finite values (NA or inf) to integer
> >>> psser.astype(int)
>  0 1.0
>  1 2.0
>  2 NaN
>  dtype: float64
> {code}
> As shown above, astype of Series with missing values doesn't behave the same 
> as pandas, we ought to adjust that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36143) Adjust astype of Series with missing values to follow pandas

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36143:


Assignee: Apache Spark

> Adjust astype of Series with missing values to follow pandas
> 
>
> Key: SPARK-36143
> URL: https://issues.apache.org/jira/browse/SPARK-36143
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> >>> pser = pd.Series([1, 2, np.nan], dtype=float)
> >>> psser = ps.from_pandas(pser)
> >>> pser.astype(int)
> ...
>  ValueError: Cannot convert non-finite values (NA or inf) to integer
> >>> psser.astype(int)
>  0 1.0
>  1 2.0
>  2 NaN
>  dtype: float64
> {code}
> As shown above, astype of Series with missing values doesn't behave the same 
> as pandas, we ought to adjust that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36143) Adjust astype of Series with missing values to follow pandas

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384999#comment-17384999
 ] 

Apache Spark commented on SPARK-36143:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33466

> Adjust astype of Series with missing values to follow pandas
> 
>
> Key: SPARK-36143
> URL: https://issues.apache.org/jira/browse/SPARK-36143
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> >>> pser = pd.Series([1, 2, np.nan], dtype=float)
> >>> psser = ps.from_pandas(pser)
> >>> pser.astype(int)
> ...
>  ValueError: Cannot convert non-finite values (NA or inf) to integer
> >>> psser.astype(int)
>  0 1.0
>  1 2.0
>  2 NaN
>  dtype: float64
> {code}
> As shown above, astype of Series with missing values doesn't behave the same 
> as pandas, we ought to adjust that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36213) Normalize PartitionSpec for DescTable with PartitionSpec

2021-07-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-36213.
--
Fix Version/s: 3.1.3
   3.2.0
   3.0.4
   Resolution: Fixed

Issue resolved by pull request 33424
[https://github.com/apache/spark/pull/33424]

> Normalize PartitionSpec for DescTable with PartitionSpec
> 
>
> Key: SPARK-36213
> URL: https://issues.apache.org/jira/browse/SPARK-36213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.4, 3.2.0, 3.1.3
>
> Attachments: image-2021-07-20-16-26-09-456.png
>
>
> !image-2021-07-20-16-26-09-456.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36213) Normalize PartitionSpec for DescTable with PartitionSpec

2021-07-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-36213:


Assignee: Kent Yao

> Normalize PartitionSpec for DescTable with PartitionSpec
> 
>
> Key: SPARK-36213
> URL: https://issues.apache.org/jira/browse/SPARK-36213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Attachments: image-2021-07-20-16-26-09-456.png
>
>
> !image-2021-07-20-16-26-09-456.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36227) Remove TimestampNTZ type support in Spark 3.2

2021-07-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36227.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33444
[https://github.com/apache/spark/pull/33444]

> Remove TimestampNTZ type support in Spark 3.2
> -
>
> Key: SPARK-36227
> URL: https://issues.apache.org/jira/browse/SPARK-36227
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> As of now, there are some blockers for delivering the TimestampNTZ project in 
> Spark 3.2:
> # In the Hive Thrift server, both TimestampType and TimestampNTZType are 
> mapped to the same timestamp type, which can cause confusion for users. 
> # For the Parquet data source, the new written TimestampNTZType Parquet 
> columns will be read as TimestampType in old Spark releases. Also, we need to 
> decide the merge schema for files mixed with TimestampType and TimestampNTZ 
> type.
> # The type coercion rules for TimestampNTZType are incomplete. For example, 
> what should the data type of the in clause "IN(Timestamp'2020-01-01 
> 00:00:00', TimestampNtz'2020-01-01 00:00:00') be.
> # It is tricky to support TimestampNTZType in JSON/CSV data readers. We need 
> to avoid regressions as possible as we can.
> There are 10 days left for the expected 3.2 RC date. So, I propose to release 
> the TimestampNTZ type in Spark 3.3 instead of Spark 3.2. So that we have 
> enough time to make considerate designs for the issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36143) Adjust astype of Series with missing values to follow pandas

2021-07-21 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36143:
-
Description: 
{code:java}
>>> pser = pd.Series([1, 2, np.nan], dtype=float)
>>> psser = ps.from_pandas(pser)
>>> pser.astype(int)
...
 ValueError: Cannot convert non-finite values (NA or inf) to integer
>>> psser.astype(int)
 0 1.0
 1 2.0
 2 NaN
 dtype: float64
{code}
As shown above, astype of Series with missing values doesn't behave the same as 
pandas, we ought to adjust that.

  was:
{code:java}
>>> pser = pd.Series([1, 2, np.nan], dtype=float)
>>> psser = ps.from_pandas(pser)
>>> pser.astype(int)
...
 ValueError: Cannot convert non-finite values (NA or inf) to integer
>>> psser.astype(int)
 0 1.0
 1 2.0
 2 NaN
 dtype: float64
{code}
As shown above, astype of Series of ExtensionDtype doesn't behave the same as 
pandas for ExtensionDtype Series, we ought to adjust that.


> Adjust astype of Series with missing values to follow pandas
> 
>
> Key: SPARK-36143
> URL: https://issues.apache.org/jira/browse/SPARK-36143
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {code:java}
> >>> pser = pd.Series([1, 2, np.nan], dtype=float)
> >>> psser = ps.from_pandas(pser)
> >>> pser.astype(int)
> ...
>  ValueError: Cannot convert non-finite values (NA or inf) to integer
> >>> psser.astype(int)
>  0 1.0
>  1 2.0
>  2 NaN
>  dtype: float64
> {code}
> As shown above, astype of Series with missing values doesn't behave the same 
> as pandas, we ought to adjust that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33865) When HiveDDL, we need check avro schema too like parquet & orc

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384951#comment-17384951
 ] 

Apache Spark commented on SPARK-33865:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33441

> When HiveDDL, we need check avro schema too like parquet & orc
> --
>
> Key: SPARK-33865
> URL: https://issues.apache.org/jira/browse/SPARK-33865
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> Caused by: org.apache.avro.SchemaParseException: Illegal initial character: 
> (IF((1 = 1), 1, 0))
>   at org.apache.avro.Schema.validateName(Schema.java:1147)
>   at org.apache.avro.Schema.access$200(Schema.java:81)
>   at org.apache.avro.Schema$Field.(Schema.java:403)
>   at org.apache.avro.Schema$Field.(Schema.java:396)
>   at 
> org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.createAvroField(TypeInfoToSchema.java:76)
>   at 
> org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.convert(TypeInfoToSchema.java:61)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.getSchemaFromCols(AvroSerDe.java:170)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:114)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121)
>   at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36243) pyspark catalog.tableExists doesn't work for temporary views

2021-07-21 Thread Dominik Gehl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Gehl updated SPARK-36243:
-
Component/s: (was: Java API)
 PySpark
Description: 
Documentation in Catalog.scala for tableExists specifies

   * Check if the table or view with the specified name exists. This can either 
be a temporary
   * view or a table/view.

The pyspark version doesn't work correctly for temporary views

  was:
Documentation in Catalog.scala for tableExists specifies

   * Check if the table or view with the specified name exists. This can either 
be a temporary
   * view or a table/view.

temporary views don't seem to work

Summary: pyspark catalog.tableExists doesn't work for temporary views  
(was: scala catalog.tableExists doesn't work for temporary views)

> pyspark catalog.tableExists doesn't work for temporary views
> 
>
> Key: SPARK-36243
> URL: https://issues.apache.org/jira/browse/SPARK-36243
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Major
>
> Documentation in Catalog.scala for tableExists specifies
>* Check if the table or view with the specified name exists. This can 
> either be a temporary
>* view or a table/view.
> The pyspark version doesn't work correctly for temporary views



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36245) Deduplicate the right side of left semi/anti join

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36245:


Assignee: Apache Spark

> Deduplicate the right side of left semi/anti join
> -
>
> Key: SPARK-36245
> URL: https://issues.apache.org/jira/browse/SPARK-36245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Deduplicate the right side of left semi/anti join to improve query 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36245) Deduplicate the right side of left semi/anti join

2021-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36245:


Assignee: (was: Apache Spark)

> Deduplicate the right side of left semi/anti join
> -
>
> Key: SPARK-36245
> URL: https://issues.apache.org/jira/browse/SPARK-36245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Deduplicate the right side of left semi/anti join to improve query 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36245) Deduplicate the right side of left semi/anti join

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384939#comment-17384939
 ] 

Apache Spark commented on SPARK-36245:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/33465

> Deduplicate the right side of left semi/anti join
> -
>
> Key: SPARK-36245
> URL: https://issues.apache.org/jira/browse/SPARK-36245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Deduplicate the right side of left semi/anti join to improve query 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-07-21 Thread Eric Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384934#comment-17384934
 ] 

Eric Richardson commented on SPARK-25075:
-

Great news that you will have 2.12 and 2.13 artifacts.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36245) Deduplicate the right side of left semi/anti join

2021-07-21 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-36245:
---

 Summary: Deduplicate the right side of left semi/anti join
 Key: SPARK-36245
 URL: https://issues.apache.org/jira/browse/SPARK-36245
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yuming Wang


Deduplicate the right side of left semi/anti join to improve query performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28266) data duplication when `path` serde property is present

2021-07-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28266:
---

Assignee: Shardul Mahadik

> data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2
>Reporter: Ruslan Dautkhanov
>Assignee: Shardul Mahadik
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28266) data duplication when `path` serde property is present

2021-07-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28266.
-
Fix Version/s: 3.1.3
   3.2.0
   3.0.4
   Resolution: Fixed

Issue resolved by pull request 33328
[https://github.com/apache/spark/pull/33328]

> data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.4, 3.2.0, 3.1.3
>
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36244) Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation

2021-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384923#comment-17384923
 ] 

Apache Spark commented on SPARK-36244:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33464

> Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation
> 
>
> Key: SPARK-36244
> URL: https://issues.apache.org/jira/browse/SPARK-36244
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> zstd-jni 1.5.0-3 was released few days ago.
> This release resolves an issue about buffer size calculation, which can 
> affect usage in Spark.
> https://github.com/luben/zstd-jni/releases/tag/v1.5.0-3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >