[jira] [Commented] (SPARK-32677) Cache function directly after create

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181651#comment-17181651
 ] 

Apache Spark commented on SPARK-32677:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29502

> Cache function directly after create
> 
>
> Key: SPARK-32677
> URL: https://issues.apache.org/jira/browse/SPARK-32677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Change `CreateFunctionCommand` code that add class check before create 
> function and add to function registry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32677) Cache function directly after create

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32677:


Assignee: Apache Spark

> Cache function directly after create
> 
>
> Key: SPARK-32677
> URL: https://issues.apache.org/jira/browse/SPARK-32677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> Change `CreateFunctionCommand` code that add class check before create 
> function and add to function registry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32677) Cache function directly after create

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181650#comment-17181650
 ] 

Apache Spark commented on SPARK-32677:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29502

> Cache function directly after create
> 
>
> Key: SPARK-32677
> URL: https://issues.apache.org/jira/browse/SPARK-32677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Change `CreateFunctionCommand` code that add class check before create 
> function and add to function registry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32677) Cache function directly after create

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32677:


Assignee: (was: Apache Spark)

> Cache function directly after create
> 
>
> Key: SPARK-32677
> URL: https://issues.apache.org/jira/browse/SPARK-32677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Change `CreateFunctionCommand` code that add class check before create 
> function and add to function registry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32677) Cache function directly after create

2020-08-20 Thread ulysses you (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-32677:

Description: Change `CreateFunctionCommand` code that add class check 
before create function and add to function registry.  (was: Change 
`CreateFunctionCommand` code that add class check before create function.)

> Cache function directly after create
> 
>
> Key: SPARK-32677
> URL: https://issues.apache.org/jira/browse/SPARK-32677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Change `CreateFunctionCommand` code that add class check before create 
> function and add to function registry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32677) Cache function after create

2020-08-20 Thread ulysses you (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-32677:

Summary: Cache function after create  (was: Cache function after create 
function)

> Cache function after create
> ---
>
> Key: SPARK-32677
> URL: https://issues.apache.org/jira/browse/SPARK-32677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Change `CreateFunctionCommand` code that add class check before create 
> function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32677) Cache function after create function

2020-08-20 Thread ulysses you (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-32677:

Summary: Cache function after create function  (was: Add class check before 
create function)

> Cache function after create function
> 
>
> Key: SPARK-32677
> URL: https://issues.apache.org/jira/browse/SPARK-32677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Change `CreateFunctionCommand` code that add class check before create 
> function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32517) Add StorageLevel.DISK_ONLY_3

2020-08-20 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181642#comment-17181642
 ] 

Dongjoon Hyun commented on SPARK-32517:
---

Lastly, thank you for your persistent interest on ARM support of Apache Spark, 
[~huangtianhua]. It's very helpful for Apache Spark community to understand the 
issue. Although I don't have a ARM machine for testing, ping me on the new JIRA.

> Add StorageLevel.DISK_ONLY_3
> 
>
> Key: SPARK-32517
> URL: https://issues.apache.org/jira/browse/SPARK-32517
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32677) Cache function directly after create

2020-08-20 Thread ulysses you (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-32677:

Summary: Cache function directly after create  (was: Cache function after 
create)

> Cache function directly after create
> 
>
> Key: SPARK-32677
> URL: https://issues.apache.org/jira/browse/SPARK-32677
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Change `CreateFunctionCommand` code that add class check before create 
> function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32677) Add class check before create function

2020-08-20 Thread ulysses you (Jira)
ulysses you created SPARK-32677:
---

 Summary: Add class check before create function
 Key: SPARK-32677
 URL: https://issues.apache.org/jira/browse/SPARK-32677
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you


Change `CreateFunctionCommand` code that add class check before create function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32517) Add StorageLevel.DISK_ONLY_3

2020-08-20 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181641#comment-17181641
 ] 

Dongjoon Hyun commented on SPARK-32517:
---

BTW, this is not a bug in a new code because `StorageLevel.DISK_ONLY_3` 
succeeds in ARM cpu. In other words, this will be an existing ARM CPU bug in 
`with replication as stream` code path and it's exposed by the change from 
'local-cluster[2,1,1024]` to 'local-cluster[3,1,1024]`.
{code}
- caching on disk, replicated 3 (encryption = off)
- caching on disk, replicated 3 (encryption = off) (with replication as stream)
- caching on disk, replicated 3 (encryption = on)
- caching on disk, replicated 3 (encryption = on) (with replication as stream)
{code}

In this case, I'd like to recommend to file a new JIRA for that.

> Add StorageLevel.DISK_ONLY_3
> 
>
> Key: SPARK-32517
> URL: https://issues.apache.org/jira/browse/SPARK-32517
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32517) Add StorageLevel.DISK_ONLY_3

2020-08-20 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181637#comment-17181637
 ] 

Dongjoon Hyun commented on SPARK-32517:
---

So, the failure situation looks like this. For the following cases,
{code}
"caching in memory, serialized, replicated" -> StorageLevel.MEMORY_ONLY_SER_2,
{code}

The number of replicas should be 2 and it does in Intel CPU. But, in ARM 
testing environment, it became 3.
{code}
- caching in memory, serialized, replicated (encryption = off)
- caching in memory, serialized, replicated (encryption = off) (with 
replication as stream)
- caching in memory, serialized, replicated (encryption = on)
- caching in memory, serialized, replicated (encryption = on) (with replication 
as stream) *** FAILED ***
  3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
{code}

{code}
- caching in memory and disk, serialized, replicated (encryption = off)
- caching in memory and disk, serialized, replicated (encryption = off) (with 
replication as stream)
- caching in memory and disk, serialized, replicated (encryption = on)
- caching in memory and disk, serialized, replicated (encryption = on) (with 
replication as stream) *** FAILED ***
  3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
{code}

> Add StorageLevel.DISK_ONLY_3
> 
>
> Key: SPARK-32517
> URL: https://issues.apache.org/jira/browse/SPARK-32517
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32669) expression unit tests should explore all cases that can lead to null result

2020-08-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32669:

Description: Add document to {{ExpressionEvalHelper}}, and ask people to 
explore all the cases that can lead to null results (including null in struct 
fields, array elements and map values).

> expression unit tests should explore all cases that can lead to null result
> ---
>
> Key: SPARK-32669
> URL: https://issues.apache.org/jira/browse/SPARK-32669
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Add document to {{ExpressionEvalHelper}}, and ask people to explore all the 
> cases that can lead to null results (including null in struct fields, array 
> elements and map values).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32669) expression unit tests should explore all cases that can lead to null result

2020-08-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32669:

Summary: expression unit tests should explore all cases that can lead to 
null result  (was: test expression nullability when checking result)

> expression unit tests should explore all cases that can lead to null result
> ---
>
> Key: SPARK-32669
> URL: https://issues.apache.org/jira/browse/SPARK-32669
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32517) Add StorageLevel.DISK_ONLY_3

2020-08-20 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181635#comment-17181635
 ] 

Dongjoon Hyun commented on SPARK-32517:
---

Thank you for informing that, [~huangtianhua]. I'll take a look.

> Add StorageLevel.DISK_ONLY_3
> 
>
> Key: SPARK-32517
> URL: https://issues.apache.org/jira/browse/SPARK-32517
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32663) TransportClient getting closed when there are outstanding requests to the server

2020-08-20 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32663:
---

Assignee: Attila Zsolt Piros

> TransportClient getting closed when there are outstanding requests to the 
> server
> 
>
> Key: SPARK-32663
> URL: https://issues.apache.org/jira/browse/SPARK-32663
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> The implementation of {{removeBlocks}} and {{getHostLocalDirs}} in 
> {{ExternalBlockStoreClient}} closes the client after processing a response in 
> the callback. 
> This is a cached client which will be re-used for other responses. There 
> could be other outstanding request to the shuffle service, so it should not 
> be closed after processing a response. 
> Seems like this is a bug introduced with SPARK-27651 and SPARK-27677. 
> The older methods  {{registerWithShuffleServer}} and {{fetchBlocks}} didn't 
> close the client.
> cc [~attilapiros] [~vanzin] [~mridulm80]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32663) TransportClient getting closed when there are outstanding requests to the server

2020-08-20 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32663.
-
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29492
[https://github.com/apache/spark/pull/29492]

> TransportClient getting closed when there are outstanding requests to the 
> server
> 
>
> Key: SPARK-32663
> URL: https://issues.apache.org/jira/browse/SPARK-32663
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> The implementation of {{removeBlocks}} and {{getHostLocalDirs}} in 
> {{ExternalBlockStoreClient}} closes the client after processing a response in 
> the callback. 
> This is a cached client which will be re-used for other responses. There 
> could be other outstanding request to the shuffle service, so it should not 
> be closed after processing a response. 
> Seems like this is a bug introduced with SPARK-27651 and SPARK-27677. 
> The older methods  {{registerWithShuffleServer}} and {{fetchBlocks}} didn't 
> close the client.
> cc [~attilapiros] [~vanzin] [~mridulm80]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181632#comment-17181632
 ] 

Jungtaek Lim commented on SPARK-32672:
--

Just FYI, he's a PMC member. And correctness issue goes normally a blocker 
unless there's some strong reason to not address the issue right now.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-32672:
-
Priority: Blocker  (was: Critical)

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32660) Show Avro related API in documentation

2020-08-20 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-32660.

Resolution: Fixed

The issue is resolved in https://github.com/apache/spark/pull/29476

> Show Avro related API in documentation
> --
>
> Key: SPARK-32660
> URL: https://issues.apache.org/jira/browse/SPARK-32660
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the Avro related APIs are missing in the documentation 
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . 
> This PR is to:
> 1. Mark internal Avro related classes as private
> 2. Show Avro related API in Spark official API documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32660) Show Avro related API in documentation

2020-08-20 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-32660:
---
Description: 
Currently, the Avro related APIs are missing in the documentation 
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . 
This PR is to:
1. Mark internal Avro related classes as private
2. Show Avro related API in Spark official API documentation

> Show Avro related API in documentation
> --
>
> Key: SPARK-32660
> URL: https://issues.apache.org/jira/browse/SPARK-32660
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the Avro related APIs are missing in the documentation 
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . 
> This PR is to:
> 1. Mark internal Avro related classes as private
> 2. Show Avro related API in Spark official API documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32660) Show Avro related API in documentation

2020-08-20 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181617#comment-17181617
 ] 

Gengliang Wang commented on SPARK-32660:


[~rohitmishr1484] sure

> Show Avro related API in documentation
> --
>
> Key: SPARK-32660
> URL: https://issues.apache.org/jira/browse/SPARK-32660
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the Avro related APIs are missing in the documentation 
> https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . 
> This PR is to:
> 1. Mark internal Avro related classes as private
> 2. Show Avro related API in Spark official API documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181561#comment-17181561
 ] 

Apache Spark commented on SPARK-32676:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29501

> Fix double caching in KMeans/BiKMeans
> -
>
> Key: SPARK-32676
> URL: https://issues.apache.org/jira/browse/SPARK-32676
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> In the .mllib side, if the storageLevel of input {{data}} is always ignored 
> and cached twice:
> {code:java}
> @Since("0.8.0")
> def run(data: RDD[Vector]): KMeansModel = {
>   val instances = data.map(point => (point, 1.0))
>   runWithWeight(instances, None)
> }
>  {code}
> {code:java}
> private[spark] def runWithWeight(
> data: RDD[(Vector, Double)],
> instr: Option[Instrumentation]): KMeansModel = {
>   // Compute squared norms and cache them.
>   val norms = data.map { case (v, _) =>
> Vectors.norm(v, 2.0)
>   }
>   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
> new VectorWithNorm(v, norm, w)
>   }
>   if (data.getStorageLevel == StorageLevel.NONE) {
> zippedData.persist(StorageLevel.MEMORY_AND_DISK)
>   }
>   val model = runAlgorithmWithWeight(zippedData, instr)
>   zippedData.unpersist()
>   model
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32676:


Assignee: Apache Spark

> Fix double caching in KMeans/BiKMeans
> -
>
> Key: SPARK-32676
> URL: https://issues.apache.org/jira/browse/SPARK-32676
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.1.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> In the .mllib side, if the storageLevel of input {{data}} is always ignored 
> and cached twice:
> {code:java}
> @Since("0.8.0")
> def run(data: RDD[Vector]): KMeansModel = {
>   val instances = data.map(point => (point, 1.0))
>   runWithWeight(instances, None)
> }
>  {code}
> {code:java}
> private[spark] def runWithWeight(
> data: RDD[(Vector, Double)],
> instr: Option[Instrumentation]): KMeansModel = {
>   // Compute squared norms and cache them.
>   val norms = data.map { case (v, _) =>
> Vectors.norm(v, 2.0)
>   }
>   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
> new VectorWithNorm(v, norm, w)
>   }
>   if (data.getStorageLevel == StorageLevel.NONE) {
> zippedData.persist(StorageLevel.MEMORY_AND_DISK)
>   }
>   val model = runAlgorithmWithWeight(zippedData, instr)
>   zippedData.unpersist()
>   model
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32676:


Assignee: (was: Apache Spark)

> Fix double caching in KMeans/BiKMeans
> -
>
> Key: SPARK-32676
> URL: https://issues.apache.org/jira/browse/SPARK-32676
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> In the .mllib side, if the storageLevel of input {{data}} is always ignored 
> and cached twice:
> {code:java}
> @Since("0.8.0")
> def run(data: RDD[Vector]): KMeansModel = {
>   val instances = data.map(point => (point, 1.0))
>   runWithWeight(instances, None)
> }
>  {code}
> {code:java}
> private[spark] def runWithWeight(
> data: RDD[(Vector, Double)],
> instr: Option[Instrumentation]): KMeansModel = {
>   // Compute squared norms and cache them.
>   val norms = data.map { case (v, _) =>
> Vectors.norm(v, 2.0)
>   }
>   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
> new VectorWithNorm(v, norm, w)
>   }
>   if (data.getStorageLevel == StorageLevel.NONE) {
> zippedData.persist(StorageLevel.MEMORY_AND_DISK)
>   }
>   val model = runAlgorithmWithWeight(zippedData, instr)
>   zippedData.unpersist()
>   model
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181560#comment-17181560
 ] 

Apache Spark commented on SPARK-32676:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29501

> Fix double caching in KMeans/BiKMeans
> -
>
> Key: SPARK-32676
> URL: https://issues.apache.org/jira/browse/SPARK-32676
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> In the .mllib side, if the storageLevel of input {{data}} is always ignored 
> and cached twice:
> {code:java}
> @Since("0.8.0")
> def run(data: RDD[Vector]): KMeansModel = {
>   val instances = data.map(point => (point, 1.0))
>   runWithWeight(instances, None)
> }
>  {code}
> {code:java}
> private[spark] def runWithWeight(
> data: RDD[(Vector, Double)],
> instr: Option[Instrumentation]): KMeansModel = {
>   // Compute squared norms and cache them.
>   val norms = data.map { case (v, _) =>
> Vectors.norm(v, 2.0)
>   }
>   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
> new VectorWithNorm(v, norm, w)
>   }
>   if (data.getStorageLevel == StorageLevel.NONE) {
> zippedData.persist(StorageLevel.MEMORY_AND_DISK)
>   }
>   val model = runAlgorithmWithWeight(zippedData, instr)
>   zippedData.unpersist()
>   model
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29967) KMeans support instance weighting

2020-08-20 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181558#comment-17181558
 ] 

zhengruifeng commented on SPARK-29967:
--

[~YuQiang Ye] I open ticket SPARK-32676 for this issue, and send a pr 
https://github.com/apache/spark/pull/29501

> KMeans support instance weighting
> -
>
> Key: SPARK-29967
> URL: https://issues.apache.org/jira/browse/SPARK-29967
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Since https://issues.apache.org/jira/browse/SPARK-9610, we start to support 
> instance weighting in ML.
> However, Clustering and other impl in features still do not support instance 
> weighting.
> I think we need to start support weighting in KMeans, like what scikit-learn 
> does.
> It will contains three parts:
> 1, move the impl from .mllib to .ml
> 2, make .mllib.KMeans as a wrapper of .ml.KMeans
> 3, support instance weighting in the .ml.KMeans



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32676:
-
Description: 
In the .mllib side, if the storageLevel of input {{data}} is always ignored and 
cached twice:
{code:java}
@Since("0.8.0")
def run(data: RDD[Vector]): KMeansModel = {
  val instances = data.map(point => (point, 1.0))
  runWithWeight(instances, None)
}
 {code}
{code:java}
private[spark] def runWithWeight(
data: RDD[(Vector, Double)],
instr: Option[Instrumentation]): KMeansModel = {

  // Compute squared norms and cache them.
  val norms = data.map { case (v, _) =>
Vectors.norm(v, 2.0)
  }

  val zippedData = data.zip(norms).map { case ((v, w), norm) =>
new VectorWithNorm(v, norm, w)
  }

  if (data.getStorageLevel == StorageLevel.NONE) {
zippedData.persist(StorageLevel.MEMORY_AND_DISK)
  }
  val model = runAlgorithmWithWeight(zippedData, instr)
  zippedData.unpersist()

  model
} {code}

  was:
In the .mllib side, if the storageLevel of input {{data}} is always ignored and 
cached twice:
{code:java}
@Since("0.8.0")
def run(data: RDD[Vector]): KMeansModel = {
  val instances = data.map(point => (point, 1.0))
  runWithWeight(instances, None)
}{code}


> Fix double caching in KMeans/BiKMeans
> -
>
> Key: SPARK-32676
> URL: https://issues.apache.org/jira/browse/SPARK-32676
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> In the .mllib side, if the storageLevel of input {{data}} is always ignored 
> and cached twice:
> {code:java}
> @Since("0.8.0")
> def run(data: RDD[Vector]): KMeansModel = {
>   val instances = data.map(point => (point, 1.0))
>   runWithWeight(instances, None)
> }
>  {code}
> {code:java}
> private[spark] def runWithWeight(
> data: RDD[(Vector, Double)],
> instr: Option[Instrumentation]): KMeansModel = {
>   // Compute squared norms and cache them.
>   val norms = data.map { case (v, _) =>
> Vectors.norm(v, 2.0)
>   }
>   val zippedData = data.zip(norms).map { case ((v, w), norm) =>
> new VectorWithNorm(v, norm, w)
>   }
>   if (data.getStorageLevel == StorageLevel.NONE) {
> zippedData.persist(StorageLevel.MEMORY_AND_DISK)
>   }
>   val model = runAlgorithmWithWeight(zippedData, instr)
>   zippedData.unpersist()
>   model
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32676) Fix double caching in KMeans/BiKMeans

2020-08-20 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-32676:


 Summary: Fix double caching in KMeans/BiKMeans
 Key: SPARK-32676
 URL: https://issues.apache.org/jira/browse/SPARK-32676
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0, 3.1.0
Reporter: zhengruifeng


In the .mllib side, if the storageLevel of input {{data}} is always ignored and 
cached twice:
{code:java}
@Since("0.8.0")
def run(data: RDD[Vector]): KMeansModel = {
  val instances = data.map(point => (point, 1.0))
  runWithWeight(instances, None)
}{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181553#comment-17181553
 ] 

Lantao Jin commented on SPARK-32672:


Changed to Critical, Blocker is reserved for committer

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32672:
---
Priority: Critical  (was: Blocker)

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Farhan Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Farhan Khan updated SPARK-32675:

Description: 
Submitted application passing --py-files option in a hardcoded manner for a 
Mesos Cluster in cluster mode using REST Submission API. It is causing a simple 
Java-based SparkPi job to fail.

This Bug is introduced by SPARK-26466.

Here is the example job submission:
{code:bash}
curl -X POST http://localhost:7077/v1/submissions/create --header 
"Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
  "spark.jars": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
  "spark.driver.supervise": "false",
  "spark.executor.memory": "512m",
  "spark.driver.memory": "512m",
  "spark.submit.deployMode": "cluster",
  "spark.app.name": "SparkPi",
  "spark.master": "mesos://localhost:5050"
}}'
{code}
Expected Driver log would contain:
{code:bash}
20/08/20 20:19:57 WARN DependencyUtils: Local jar 
/var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
 does not exist, skipping.
Error: Failed to load class org.apache.spark.examples.SparkPi.
20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
{code}

  was:
Submitted application passing --py-files option in a hardcoded manner for a 
Mesos Cluster in cluster mode using REST Submission API. It is causing a simple 
Java-based SparkPi job to fail.

This Bug is introduced by SPARK-26466.

Here is the example job submission:
{code:bash}
curl -X POST http://localhost:7077/v1/submissions/create --header 
"Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
  "spark.jars": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
  "spark.driver.supervise": "false",
  "spark.executor.memory": "512m",
  "spark.driver.memory": "512m",
  "spark.submit.deployMode": "cluster",
  "spark.app.name": "SparkPi",
  "spark.master": "mesos://localhost:5050"
}}'
{code}

Expected Dispatcher output would contain:

{code:bash}
20/08/20 20:19:57 WARN DependencyUtils: Local jar 
/var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
 does not exist, skipping.
Error: Failed to load class org.apache.spark.examples.SparkPi.
20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
{code}


> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Driver log would contain:
> {code:bash}
> 20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipp

[jira] [Commented] (SPARK-32673) Pyspark/cloudpickle.py - no module named 'wfdb'

2020-08-20 Thread Sandy Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181526#comment-17181526
 ] 

Sandy Su commented on SPARK-32673:
--

df_signals = df_record_names.repartition('record_name').select(
 df_record_names.record_id,
 extract_signals_udf(df_record_names.record_name).alias('signal_info'))

df_signals = df_signals.select(df_signals.record_id,
 df_signals.signal_info.patient_id.alias('patient_id'),
 df_signals.signal_info.comments.alias('comments'),
 df_signals.signal_info.signals.alias('signals'))

display(df_signals.drop('signals'))

> Pyspark/cloudpickle.py - no module named 'wfdb'
> ---
>
> Key: SPARK-32673
> URL: https://issues.apache.org/jira/browse/SPARK-32673
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Sandy Su
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Running Spark in a Databricks notebook.
>  
> Ran into this issue when executing a cell:
> (1) Spark Jobs
> SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 68, 
> 10.139.64.5, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb' During handling 
> of the above exception, another exception occurred: Traceback (most recent 
> call last): File "/databricks/spark/python/pyspark/worker.py", line 644, in 
> main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type) File "/databricks/spark/python/pyspark/worker.py", line 463, in 
> read_udfs udfs.append(read_single_udf(pickleSer, infile, eval_type, 
> runner_conf, udf_index=i)) File "/databricks/spark/python/pyspark/worker.py", 
> line 254, in read_single_udf f, return_type = read_command(pickleSer, infile) 
> File "/databricks/spark/python/pyspark/worker.py", line 74, in read_command 
> command = serializer._read_with_length(file) File 
> "/databricks/spark/python/pyspark/serializers.py", line 180, in 
> _read_with_length raise SerializationError("Caused by " + 
> traceback.format_exc()) pyspark.serializers.SerializationError: Caused by 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181524#comment-17181524
 ] 

Apache Spark commented on SPARK-32667:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29500

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32667:


Assignee: (was: Apache Spark)

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181523#comment-17181523
 ] 

Apache Spark commented on SPARK-32667:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29500

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32667:


Assignee: Apache Spark

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32667:
--
Description: 
Scrip transform no-serde mode should pad null value to filling column
{code:java}
hive> SELECT TRANSFORM(a, b)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY '|'
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'NULL'
> USING 'cat' as (a string, b string, c string, d string)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY '|'
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'NULL'
> FROM (
> select 1 as a, 2 as b
> ) tmp ;
OK
1   2   NULLNULL
Time taken: 24.626 seconds, Fetched: 1 row(s)

{code}

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32632.
--
Resolution: Not A Problem

> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When I use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method,  for  rows of the 
> first partition aren't limited by the lowerBound and the ones of the last 
> partition are not limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition should be " `id` >=2 and `id` < 3 
> "  because the lowerBound is 2, and the clause of the last partition should 
> be " `id` >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-20 Thread Liu Dinghua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181517#comment-17181517
 ] 

Liu Dinghua commented on SPARK-32632:
-

Thanks, when partitioning , what if we put the lowerBound in the where clause 
of the first partition and the upperBound in the last partition? does  it 
result in anything worse ?

> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When I use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method,  for  rows of the 
> first partition aren't limited by the lowerBound and the ones of the last 
> partition are not limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition should be " `id` >=2 and `id` < 3 
> "  because the lowerBound is 2, and the clause of the last partition should 
> be " `id` >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Farhan Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Farhan Khan updated SPARK-32675:

Comment: was deleted

(was: Implementing PR: [https://github.com/apache/spark/pull/29499])

> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would contain:
> {code:bash}
> 20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipping.
> Error: Failed to load class org.apache.spark.examples.SparkPi.
> 20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Farhan Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Farhan Khan updated SPARK-32675:

Description: 
Submitted application passing --py-files option in a hardcoded manner for a 
Mesos Cluster in cluster mode using REST Submission API. It is causing a simple 
Java-based SparkPi job to fail.

This Bug is introduced by SPARK-26466.

Here is the example job submission:
{code:bash}
curl -X POST http://localhost:7077/v1/submissions/create --header 
"Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
  "spark.jars": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
  "spark.driver.supervise": "false",
  "spark.executor.memory": "512m",
  "spark.driver.memory": "512m",
  "spark.submit.deployMode": "cluster",
  "spark.app.name": "SparkPi",
  "spark.master": "mesos://localhost:5050"
}}'
{code}

Expected Dispatcher output would contain:

{code:bash}
20/08/20 20:19:57 WARN DependencyUtils: Local jar 
/var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
 does not exist, skipping.
Error: Failed to load class org.apache.spark.examples.SparkPi.
20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
{code}

  was:
Submitted application passing --py-files option in a hardcoded manner for a 
Mesos Cluster in cluster mode using REST Submission API. It is causing a simple 
Java-based SparkPi job to fail.

This Bug is introduced by SPARK-26466.

Here is the example job submission:
{code:bash}
curl -X POST http://localhost:7077/v1/submissions/create --header 
"Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
  "spark.jars": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
  "spark.driver.supervise": "false",
  "spark.executor.memory": "512m",
  "spark.driver.memory": "512m",
  "spark.submit.deployMode": "cluster",
  "spark.app.name": "SparkPi",
  "spark.master": "mesos://localhost:5050"
}}'
{code}
Expected Dispatcher output would contain:
{code}
Unable to find source-code formatter for language: log. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
/var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
 does not exist, skipping.
Error: Failed to load class org.apache.spark.examples.SparkPi.
20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
{code}


> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would

[jira] [Commented] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Farhan Khan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181513#comment-17181513
 ] 

Farhan Khan commented on SPARK-32675:
-

Implementing PR: [https://github.com/apache/spark/pull/29499]

> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would contain:
> {code}
> Unable to find source-code formatter for language: log. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipping.
> Error: Failed to load class org.apache.spark.examples.SparkPi.
> 20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181514#comment-17181514
 ] 

Apache Spark commented on SPARK-32675:
--

User 'farhan5900' has created a pull request for this issue:
https://github.com/apache/spark/pull/29499

> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would contain:
> {code}
> Unable to find source-code formatter for language: log. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipping.
> Error: Failed to load class org.apache.spark.examples.SparkPi.
> 20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32675:


Assignee: (was: Apache Spark)

> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would contain:
> {code}
> Unable to find source-code formatter for language: log. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipping.
> Error: Failed to load class org.apache.spark.examples.SparkPi.
> 20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32675:


Assignee: Apache Spark

> --py-files option is appended without passing value for it
> --
>
> Key: SPARK-32675
> URL: https://issues.apache.org/jira/browse/SPARK-32675
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Farhan Khan
>Assignee: Apache Spark
>Priority: Major
>
> Submitted application passing --py-files option in a hardcoded manner for a 
> Mesos Cluster in cluster mode using REST Submission API. It is causing a 
> simple Java-based SparkPi job to fail.
> This Bug is introduced by SPARK-26466.
> Here is the example job submission:
> {code:bash}
> curl -X POST http://localhost:7077/v1/submissions/create --header 
> "Content-Type:application/json" --data '{
> "action": "CreateSubmissionRequest",
> "appResource": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
> "clientSparkVersion": "3.0.0",
> "appArgs": ["30"],
> "environmentVariables": {},
> "mainClass": "org.apache.spark.examples.SparkPi",
> "sparkProperties": {
>   "spark.jars": 
> "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
>   "spark.driver.supervise": "false",
>   "spark.executor.memory": "512m",
>   "spark.driver.memory": "512m",
>   "spark.submit.deployMode": "cluster",
>   "spark.app.name": "SparkPi",
>   "spark.master": "mesos://localhost:5050"
> }}'
> {code}
> Expected Dispatcher output would contain:
> {code}
> Unable to find source-code formatter for language: log. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
> /var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
>  does not exist, skipping.
> Error: Failed to load class org.apache.spark.examples.SparkPi.
> 20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32675) --py-files option is appended without passing value for it

2020-08-20 Thread Farhan Khan (Jira)
Farhan Khan created SPARK-32675:
---

 Summary: --py-files option is appended without passing value for it
 Key: SPARK-32675
 URL: https://issues.apache.org/jira/browse/SPARK-32675
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 3.0.0
Reporter: Farhan Khan


Submitted application passing --py-files option in a hardcoded manner for a 
Mesos Cluster in cluster mode using REST Submission API. It is causing a simple 
Java-based SparkPi job to fail.

This Bug is introduced by SPARK-26466.

Here is the example job submission:
{code:bash}
curl -X POST http://localhost:7077/v1/submissions/create --header 
"Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
  "spark.jars": 
"file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
  "spark.driver.supervise": "false",
  "spark.executor.memory": "512m",
  "spark.driver.memory": "512m",
  "spark.submit.deployMode": "cluster",
  "spark.app.name": "SparkPi",
  "spark.master": "mesos://localhost:5050"
}}'
{code}
Expected Dispatcher output would contain:
{code}
Unable to find source-code formatter for language: log. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yaml20/08/20 20:19:57 WARN DependencyUtils: Local jar 
/var/lib/mesos/slaves/e6779377-08ec-4765-9bfc-d27082fbcfa1-S0/frameworks/e6779377-08ec-4765-9bfc-d27082fbcfa1-/executors/driver-20200820201954-0002/runs/d9d734e8-a299-4d87-8f33-b134c65c422b/spark.driver.memory=512m
 does not exist, skipping.
Error: Failed to load class org.apache.spark.examples.SparkPi.
20/08/20 20:19:57 INFO ShutdownHookManager: Shutdown hook called
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32673) Pyspark/cloudpickle.py - no module named 'wfdb'

2020-08-20 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181504#comment-17181504
 ] 

Takeshi Yamamuro commented on SPARK-32673:
--

could you please show us a example query to reproduce this?

> Pyspark/cloudpickle.py - no module named 'wfdb'
> ---
>
> Key: SPARK-32673
> URL: https://issues.apache.org/jira/browse/SPARK-32673
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Sandy Su
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Running Spark in a Databricks notebook.
>  
> Ran into this issue when executing a cell:
> (1) Spark Jobs
> SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 68, 
> 10.139.64.5, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb' During handling 
> of the above exception, another exception occurred: Traceback (most recent 
> call last): File "/databricks/spark/python/pyspark/worker.py", line 644, in 
> main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type) File "/databricks/spark/python/pyspark/worker.py", line 463, in 
> read_udfs udfs.append(read_single_udf(pickleSer, infile, eval_type, 
> runner_conf, udf_index=i)) File "/databricks/spark/python/pyspark/worker.py", 
> line 254, in read_single_udf f, return_type = read_command(pickleSer, infile) 
> File "/databricks/spark/python/pyspark/worker.py", line 74, in read_command 
> command = serializer._read_with_length(file) File 
> "/databricks/spark/python/pyspark/serializers.py", line 180, in 
> _read_with_length raise SerializationError("Caused by " + 
> traceback.format_exc()) pyspark.serializers.SerializationError: Caused by 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32673) Pyspark/cloudpickle.py - no module named 'wfdb'

2020-08-20 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32673:
-
Flags:   (was: Important)

> Pyspark/cloudpickle.py - no module named 'wfdb'
> ---
>
> Key: SPARK-32673
> URL: https://issues.apache.org/jira/browse/SPARK-32673
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Sandy Su
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Running Spark in a Databricks notebook.
>  
> Ran into this issue when executing a cell:
> (1) Spark Jobs
> SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 68, 
> 10.139.64.5, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb' During handling 
> of the above exception, another exception occurred: Traceback (most recent 
> call last): File "/databricks/spark/python/pyspark/worker.py", line 644, in 
> main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type) File "/databricks/spark/python/pyspark/worker.py", line 463, in 
> read_udfs udfs.append(read_single_udf(pickleSer, infile, eval_type, 
> runner_conf, udf_index=i)) File "/databricks/spark/python/pyspark/worker.py", 
> line 254, in read_single_udf f, return_type = read_command(pickleSer, infile) 
> File "/databricks/spark/python/pyspark/worker.py", line 74, in read_command 
> command = serializer._read_with_length(file) File 
> "/databricks/spark/python/pyspark/serializers.py", line 180, in 
> _read_with_length raise SerializationError("Caused by " + 
> traceback.format_exc()) pyspark.serializers.SerializationError: Caused by 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-32672:

Affects Version/s: 3.1.0

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32674) Add suggestion for parallel directory listing in tuning doc

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32674:


Assignee: Apache Spark

> Add suggestion for parallel directory listing in tuning doc
> ---
>
> Key: SPARK-32674
> URL: https://issues.apache.org/jira/browse/SPARK-32674
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Minor
>
> Sometimes directory listing could become a bottleneck when user jobs have 
> large number of input directories. This is especially true when against 
> object store like S3. 
> There are a few parameters to tune this. This proposes to add some info in 
> the tuning guide so that the knowledge can be better shared. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32674) Add suggestion for parallel directory listing in tuning doc

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32674:


Assignee: (was: Apache Spark)

> Add suggestion for parallel directory listing in tuning doc
> ---
>
> Key: SPARK-32674
> URL: https://issues.apache.org/jira/browse/SPARK-32674
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
>
> Sometimes directory listing could become a bottleneck when user jobs have 
> large number of input directories. This is especially true when against 
> object store like S3. 
> There are a few parameters to tune this. This proposes to add some info in 
> the tuning guide so that the knowledge can be better shared. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32674) Add suggestion for parallel directory listing in tuning doc

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181493#comment-17181493
 ] 

Apache Spark commented on SPARK-32674:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29498

> Add suggestion for parallel directory listing in tuning doc
> ---
>
> Key: SPARK-32674
> URL: https://issues.apache.org/jira/browse/SPARK-32674
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
>
> Sometimes directory listing could become a bottleneck when user jobs have 
> large number of input directories. This is especially true when against 
> object store like S3. 
> There are a few parameters to tune this. This proposes to add some info in 
> the tuning guide so that the knowledge can be better shared. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32646) ORC predicate pushdown should work with case-insensitive analysis

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181492#comment-17181492
 ] 

Apache Spark commented on SPARK-32646:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29498

> ORC predicate pushdown should work with case-insensitive analysis
> -
>
> Key: SPARK-32646
> URL: https://issues.apache.org/jira/browse/SPARK-32646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently ORC predicate pushdown doesn't work with case-insensitive analysis, 
> see SPARK-32622 for the test case.
> We should make ORC predicate pushdown work with case-insensitive analysis too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32674) Add suggestion for parallel directory listing in tuning doc

2020-08-20 Thread Chao Sun (Jira)
Chao Sun created SPARK-32674:


 Summary: Add suggestion for parallel directory listing in tuning 
doc
 Key: SPARK-32674
 URL: https://issues.apache.org/jira/browse/SPARK-32674
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Chao Sun


Sometimes directory listing could become a bottleneck when user jobs have large 
number of input directories. This is especially true when against object store 
like S3. 

There are a few parameters to tune this. This proposes to add some info in the 
tuning guide so that the knowledge can be better shared. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181486#comment-17181486
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I added some debugging to the compression code and it looks like in the 8th 
CompressedBatch of 10,000 entries the number of nulls seen was different from 
the number expected.

619 expected and 618 seen.  I'll try to debug this a bit more tomorrow.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181478#comment-17181478
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I did a little debugging and found that `BooleanBitSet$Encoder` is being used 
for compression.  There are other data orderings that use the same encoder and 
produce correct results though.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181468#comment-17181468
 ] 

Thomas Graves commented on SPARK-32672:
---

[~cloud_fan] [~ruifengz]

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32672:
--
Affects Version/s: 3.0.1

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-32672:
--
Labels: correctness  (was: )

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181466#comment-17181466
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I verified that this is still happening on 3.1.0-SNAPSHOT  too

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32640:
-
Labels: correctness  (was: )

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.0
>
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32640:
-
Labels: correction  (was: )

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correction
> Fix For: 3.1.0
>
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32640:
-
Labels:   (was: correction)

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181459#comment-17181459
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I verified that this is still happening on 3.0.2-SNAPSHOT

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-32672:

Affects Version/s: 2.4.6

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32673) Pyspark/cloudpickle.py - no module named 'wfdb'

2020-08-20 Thread Sandy Su (Jira)
Sandy Su created SPARK-32673:


 Summary: Pyspark/cloudpickle.py - no module named 'wfdb'
 Key: SPARK-32673
 URL: https://issues.apache.org/jira/browse/SPARK-32673
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Sandy Su


Running Spark in a Databricks notebook.

 

Ran into this issue when executing a cell:



(1) Spark Jobs

SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 17.0 (TID 68, 10.139.64.5, 
executor 0): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 
177, in _read_with_length return self.loads(obj) File 
"/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
pickle.loads(obj, encoding=encoding) File 
"/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
__import__(name) ModuleNotFoundError: No module named 'wfdb' During handling of 
the above exception, another exception occurred: Traceback (most recent call 
last): File "/databricks/spark/python/pyspark/worker.py", line 644, in main 
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
eval_type) File "/databricks/spark/python/pyspark/worker.py", line 463, in 
read_udfs udfs.append(read_single_udf(pickleSer, infile, eval_type, 
runner_conf, udf_index=i)) File "/databricks/spark/python/pyspark/worker.py", 
line 254, in read_single_udf f, return_type = read_command(pickleSer, infile) 
File "/databricks/spark/python/pyspark/worker.py", line 74, in read_command 
command = serializer._read_with_length(file) File 
"/databricks/spark/python/pyspark/serializers.py", line 180, in 
_read_with_length raise SerializationError("Caused by " + 
traceback.format_exc()) pyspark.serializers.SerializationError: Caused by 
Traceback (most recent call last): File 
"/databricks/spark/python/pyspark/serializers.py", line 177, in 
_read_with_length return self.loads(obj) File 
"/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
pickle.loads(obj, encoding=encoding) File 
"/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
__import__(name) ModuleNotFoundError: No module named 'wfdb'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-32672:

Summary: Data corruption in some cached compressed boolean columns  (was: 
Daat corruption in some cached compressed boolean columns)

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-32672:

Attachment: bad_order.snappy.parquet

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32672) Daat corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)
Robert Joseph Evans created SPARK-32672:
---

 Summary: Daat corruption in some cached compressed boolean columns
 Key: SPARK-32672
 URL: https://issues.apache.org/jira/browse/SPARK-32672
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Robert Joseph Evans
 Attachments: bad_order.snappy.parquet

I found that when sorting some boolean data into the cache that the results can 
change when the data is read back out.

It needs to be a non-trivial amount of data, and it is highly dependent on the 
order of the data.  If I disable compression in the cache the issue goes away.  
I was able to make this happen in 3.0.0.  I am going to try and reproduce it in 
other versions too.

I'll attach the parquet file with boolean data in an order that causes this to 
happen. As you can see after the data is cached a single null values switches 
over to be false.

{code}
scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
bad_order: org.apache.spark.sql.DataFrame = [b: boolean]

scala> bad_order.groupBy("b").count.show
+-+-+
|b|count|
+-+-+
| null| 7153|
| true|54334|
|false|54021|
+-+-+


scala> bad_order.cache()
res1: bad_order.type = [b: boolean]

scala> bad_order.groupBy("b").count.show
+-+-+
|b|count|
+-+-+
| null| 7152|
| true|54334|
|false|54022|
+-+-+


scala> 

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181447#comment-17181447
 ] 

Apache Spark commented on SPARK-32670:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29497

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32670:


Assignee: Xiao Li  (was: Apache Spark)

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181448#comment-17181448
 ] 

Apache Spark commented on SPARK-32670:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29497

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32670:


Assignee: Apache Spark  (was: Xiao Li)

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-31214) Upgrade Janino to 3.1.2

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-31214.
-

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31214) Upgrade Janino to 3.1.2

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31214.
---
Resolution: Invalid

I close this issue as `Invalid` because this is reverted due to the correctness 
issue reported at SPARK-32640.

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31214) Upgrade Janino to 3.1.2

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-31214:
---
  Assignee: (was: Jungtaek Lim)

This is reverted via SPARK-32640

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31214) Upgrade Janino to 3.1.2

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31214:
--
Fix Version/s: (was: 3.1.0)

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31101) Upgrade Janino to 3.0.16

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31101:
--
Fix Version/s: 2.4.6
   3.0.0

> Upgrade Janino to 3.0.16
> 
>
> Key: SPARK-31101
> URL: https://issues.apache.org/jira/browse/SPARK-31101
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> We got some report on failure on user's query which Janino throws error on 
> compiling generated code. The issue is here: janino-compiler/janino#113 It 
> contains the information of generated code, symptom (error), and analysis of 
> the bug, so please refer the link for more details.
> Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable 
> Janino to succeed to compile user's query properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31101) Upgrade Janino to 3.0.16

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31101:
--
Fix Version/s: (was: 2.4.6)
   (was: 3.0.0)

> Upgrade Janino to 3.0.16
> 
>
> Key: SPARK-31101
> URL: https://issues.apache.org/jira/browse/SPARK-31101
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> We got some report on failure on user's query which Janino throws error on 
> compiling generated code. The issue is here: janino-compiler/janino#113 It 
> contains the information of generated code, symptom (error), and analysis of 
> the bug, so please refer the link for more details.
> Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable 
> Janino to succeed to compile user's query properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32640:
-

Assignee: Wenchen Fan

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32640.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29495
[https://github.com/apache/spark/pull/29495]

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved SPARK-32671.

Resolution: Invalid

I was mistaken about this issue.

> Race condition in MapOutputTracker.getStatistics
> 
>
> Key: SPARK-32671
> URL: https://issues.apache.org/jira/browse/SPARK-32671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Andy Grove
>Priority: Major
>
> MapOutputTracker.getStatistics builds an array of partition sizes for a 
> shuffle id and in some cases uses multiple threads running in parallel to 
> update this array. This code is not thread-safe and the output is 
> non-deterministic when there are multiple MapStatus entries for the same 
> partition.
> We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite 
> that depend on the output being deterministic, and intermittent failures in 
> these tests led me to track this bug down.
> The issue is trivial to fix by using an AtomicLong when building the array of 
> partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32660) Show Avro related API in documentation

2020-08-20 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181404#comment-17181404
 ] 

Rohit Mishra commented on SPARK-32660:
--

[~Gengliang.Wang], Can you please add a description?

> Show Avro related API in documentation
> --
>
> Key: SPARK-32660
> URL: https://issues.apache.org/jira/browse/SPARK-32660
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32668) HiveGenericUDTF initialize UDTF should use StructObjectInspector method

2020-08-20 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181403#comment-17181403
 ] 

Rohit Mishra commented on SPARK-32668:
--

[~ulysses], Can you please add a description?

> HiveGenericUDTF initialize UDTF should use StructObjectInspector method
> ---
>
> Key: SPARK-32668
> URL: https://issues.apache.org/jira/browse/SPARK-32668
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32669) test expression nullability when checking result

2020-08-20 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181401#comment-17181401
 ] 

Rohit Mishra commented on SPARK-32669:
--

[~cloud_fan], Can you please add a description?

> test expression nullability when checking result
> 
>
> Key: SPARK-32669
> URL: https://issues.apache.org/jira/browse/SPARK-32669
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-08-20 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181400#comment-17181400
 ] 

Rohit Mishra commented on SPARK-32670:
--

[~smilegator], Can you kindly add more to the description? It will be helpful 
for others. 

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-20 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181393#comment-17181393
 ] 

Rohit Mishra commented on SPARK-32667:
--

[~angerszhuuu], Can you please add a description?

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24266) Spark client terminates while driver is still running

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181363#comment-17181363
 ] 

Apache Spark commented on SPARK-24266:
--

User 'jkleckner' has created a pull request for this issue:
https://github.com/apache/spark/pull/29496

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Chun Chen
>Priority: Major
> Fix For: 3.1.0
>
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> downloa

[jira] [Commented] (SPARK-24266) Spark client terminates while driver is still running

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181362#comment-17181362
 ] 

Apache Spark commented on SPARK-24266:
--

User 'jkleckner' has created a pull request for this issue:
https://github.com/apache/spark/pull/29496

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Chun Chen
>Priority: Major
> Fix For: 3.1.0
>
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> downloa

[jira] [Commented] (SPARK-31800) Unable to disable Kerberos when submitting jobs to Kubernetes

2020-08-20 Thread James Boylan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181349#comment-17181349
 ] 

James Boylan commented on SPARK-31800:
--

This issue definitely seems to persist. I would note that this testing was done 
on Hadoop 3.2 pre-built Spark 3.0 version. I have not done testing on the 
Hadoop 2.7 pre-built Spark 3.0 version.

[~jagadeesh.n] - Were you testing on the same version?

> Unable to disable Kerberos when submitting jobs to Kubernetes
> -
>
> Key: SPARK-31800
> URL: https://issues.apache.org/jira/browse/SPARK-31800
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: James Boylan
>Priority: Major
>
> When you attempt to submit a process to Kubernetes using spark-submit through 
> --master, it returns the exception:
> {code:java}
> 20/05/22 20:25:54 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> Exception in thread "main" org.apache.spark.SparkException: Please specify 
> spark.kubernetes.file.upload.path property.
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:290)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246)
> at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:165)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)
> at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
> at scala.collection.immutable.List.foldLeft(List.scala:89)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 20/05/22 20:25:54 INFO ShutdownHookManager: Shutdown hook called
> 20/05/22 20:25:54 INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/p1/y24myg413wx1l1l52bsdn2hrgq/T/spark-c94db9c5-b8a8-414d-b01d-f6369d31c9b8
>  {code}
> No changes in settings appear to be able to disable Kerberos. This is when 
> running a simple execution of the SparkPi on our lab cluster. The command 
> being used is
> {code:java}
> ./bin/spark-submit --master k8s://https://{api_hostname} --deploy

[jira] [Updated] (SPARK-32621) "path" option is added again to input paths during infer()

2020-08-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32621:

Fix Version/s: 3.0.1

> "path" option is added again to input paths during infer()
> --
>
> Key: SPARK-32621
> URL: https://issues.apache.org/jira/browse/SPARK-32621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> When "path" option is used when creating a DataFrame, it can cause issues 
> during infer.
> {code:java}
> class TestFileFilter extends PathFilter {
>   override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
> }
> val path = "/tmp"
> val df = spark.range(2)
> df.write.json(path + "/p=1")
> df.write.json(path + "/p=2")
> val extraOptions = Map(
>   "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
>   "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
> )
> // This works fine.
> assert(spark.read.options(extraOptions).json(path).count == 2)
> // The following with "path" option fails with the following:
> // assertion failed: Conflicting directory structures detected. Suspicious 
> paths
> //file:/tmp
> //file:/tmp/p=1
> assert(spark.read.options(extraOptions).format("json").option("path", 
> path).load.count() === 2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181343#comment-17181343
 ] 

Apache Spark commented on SPARK-32640:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29495

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32640:


Assignee: (was: Apache Spark)

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181342#comment-17181342
 ] 

Apache Spark commented on SPARK-32640:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29495

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32640) Spark 3.1 log(NaN) returns null instead of NaN

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32640:


Assignee: Apache Spark

> Spark 3.1 log(NaN) returns null instead of NaN
> --
>
> Key: SPARK-32640
> URL: https://issues.apache.org/jira/browse/SPARK-32640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> I was testing Spark 3.1.0 and I noticed that if you take the log(NaN) it now 
> returns a null whereas in Spark 3.0 it returned a NaN.  I'm not an expert in 
> this but I thought NaN was correct.
> Spark 3.1.0 Example:
> >>> df.selectExpr(["value", "log1p(value)"]).show()
> +--+-+
> |    value|  LOG1P(value)|
> +--+-+
> |-3.4028235E38|  null|
> |3.4028235E38|88.72283906194683|
> |  0.0|   0.0|
> | -0.0|  -0.0|
> |  1.0|0.6931471805599453|
> | -1.0|  null|
> |  NaN|  null|
> +--+-+
>  
> Spark 3.0.0 example:
>  
> +-+--+
> | value| LOG1P(value)|
> +-+--+
> |-3.4028235E38| null|
> | 3.4028235E38| 88.72283906194683|
> | 0.0| 0.0|
> | -0.0| -0.0|
> | 1.0|0.6931471805599453|
> | -1.0| null|
> | NaN| NaN|
> +-+--+
>  
> Note it also does the same for log1p, log2, log10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32671:


Assignee: Apache Spark

> Race condition in MapOutputTracker.getStatistics
> 
>
> Key: SPARK-32671
> URL: https://issues.apache.org/jira/browse/SPARK-32671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Andy Grove
>Assignee: Apache Spark
>Priority: Major
>
> MapOutputTracker.getStatistics builds an array of partition sizes for a 
> shuffle id and in some cases uses multiple threads running in parallel to 
> update this array. This code is not thread-safe and the output is 
> non-deterministic when there are multiple MapStatus entries for the same 
> partition.
> We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite 
> that depend on the output being deterministic, and intermittent failures in 
> these tests led me to track this bug down.
> The issue is trivial to fix by using an AtomicLong when building the array of 
> partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181338#comment-17181338
 ] 

Apache Spark commented on SPARK-32671:
--

User 'andygrove' has created a pull request for this issue:
https://github.com/apache/spark/pull/29494

> Race condition in MapOutputTracker.getStatistics
> 
>
> Key: SPARK-32671
> URL: https://issues.apache.org/jira/browse/SPARK-32671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Andy Grove
>Priority: Major
>
> MapOutputTracker.getStatistics builds an array of partition sizes for a 
> shuffle id and in some cases uses multiple threads running in parallel to 
> update this array. This code is not thread-safe and the output is 
> non-deterministic when there are multiple MapStatus entries for the same 
> partition.
> We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite 
> that depend on the output being deterministic, and intermittent failures in 
> these tests led me to track this bug down.
> The issue is trivial to fix by using an AtomicLong when building the array of 
> partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32671:


Assignee: (was: Apache Spark)

> Race condition in MapOutputTracker.getStatistics
> 
>
> Key: SPARK-32671
> URL: https://issues.apache.org/jira/browse/SPARK-32671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Andy Grove
>Priority: Major
>
> MapOutputTracker.getStatistics builds an array of partition sizes for a 
> shuffle id and in some cases uses multiple threads running in parallel to 
> update this array. This code is not thread-safe and the output is 
> non-deterministic when there are multiple MapStatus entries for the same 
> partition.
> We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite 
> that depend on the output being deterministic, and intermittent failures in 
> these tests led me to track this bug down.
> The issue is trivial to fix by using an AtomicLong when building the array of 
> partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32671) Race condition in MapOutputTracker.getStatistics

2020-08-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181337#comment-17181337
 ] 

Apache Spark commented on SPARK-32671:
--

User 'andygrove' has created a pull request for this issue:
https://github.com/apache/spark/pull/29494

> Race condition in MapOutputTracker.getStatistics
> 
>
> Key: SPARK-32671
> URL: https://issues.apache.org/jira/browse/SPARK-32671
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Andy Grove
>Priority: Major
>
> MapOutputTracker.getStatistics builds an array of partition sizes for a 
> shuffle id and in some cases uses multiple threads running in parallel to 
> update this array. This code is not thread-safe and the output is 
> non-deterministic when there are multiple MapStatus entries for the same 
> partition.
> We have unit tests such as the skewed join tests in AdaptiveQueryExecSuite 
> that depend on the output being deterministic, and intermittent failures in 
> these tests led me to track this bug down.
> The issue is trivial to fix by using an AtomicLong when building the array of 
> partition sizes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >