[jira] [Commented] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure

2020-01-28 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025655#comment-17025655
 ] 

Shixiong Zhu commented on SPARK-28556:
--

This is not deprecating the API. We just fixed an issue in an experimental API. 
Hence, I don't think we need any deprecation note.

> Error should also be sent to QueryExecutionListener.onFailure
> -
>
> Key: SPARK-28556
> URL: https://issues.apache.org/jira/browse/SPARK-28556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Right now Error is not sent to QueryExecutionListener.onFailure. If there is 
> any Error when running a query, QueryExecutionListener.onFailure cannot be 
> triggered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-28 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025651#comment-17025651
 ] 

Maxim Gekk commented on SPARK-30668:


> This is not mentioned in the migration guide.

It is mentioned:
{code}
- The `unix_timestamp`, `date_format`, `to_unix_timestamp`, 
`from_unixtime`, `to_date`, `to_timestamp` functions. New implementation 
supports pattern formats as described here 
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html
 and performs strict checking of its input. For example, the `2015-07-22 
10:00:00` timestamp cannot be parse if pattern is `-MM-dd` because the 
parser does not consume whole input. Another example is the `31/01/2015 00:00` 
input cannot be parsed by the `dd/MM/ hh:mm` pattern because `hh` supposes 
hours in the range `1-12`.
{code}
 
> Do we have a simple way to remove such a behavior change? 

The change is related to the migration to Proleptic Gregorian calendar. To 
remove the behavior, you need to revert most of 
https://issues.apache.org/jira/browse/SPARK-26651 and maybe more.

> For example, converting the pattern for users?

Even it is possible to convert patterns, the result can be different for old 
dates due to the calendar system.

> Can we let users choose different parsing mechanisms between SimpleDateFormat 
> and DateTimeFormat?

No, a flag was removed 1 year ago, see 
https://issues.apache.org/jira/browse/SPARK-26503 and see 
https://github.com/apache/spark/pull/23391#discussion_r244414750

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-28 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025644#comment-17025644
 ] 

Xiao Li commented on SPARK-30668:
-

Can we let users choose different parsing mechanisms between SimpleDateFormat 
and DateTimeFormat? 

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-28 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025643#comment-17025643
 ] 

Xiao Li commented on SPARK-30668:
-

This will make the migration very painful. This is not mentioned in the 
migration guide. It will also generate different query results. Do we have a 
simple way to remove such a behavior change? For example, converting the 
pattern for users?

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21823) ALTER TABLE table statements such as RENAME and CHANGE columns should raise error if there are any dependent constraints.

2020-01-28 Thread sakshi chourasia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025611#comment-17025611
 ] 

sakshi chourasia edited comment on SPARK-21823 at 1/29/20 6:52 AM:
---

Hi [~ksunitha]

I work with spark sql. Need this command to alter the table name. I have some 
100+ table present in Prod and Dev env. May I know when we are planning to let 
this command out.


was (Author: sakshi49):
I work with spark sql. Need this command to alter the table name. I have some 
100+ table present in Prod and Dev env. May I know when we are planning to let 
this command out.

> ALTER TABLE table statements  such as RENAME and CHANGE columns should  raise 
>  error if there are any dependent constraints. 
> -
>
> Key: SPARK-21823
> URL: https://issues.apache.org/jira/browse/SPARK-21823
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> Following ALTER TABLE DDL statements will impact  the  informational 
> constraints defined on a table:
> {code:sql}
> ALTER TABLE name RENAME TO new_name
> ALTER TABLE name CHANGE column_name new_name new_type
> {code}
> Spark SQL should raise errors if there are 
 informational constraints 
> defined on the columns  affected by the ALTER  and let the user drop 
> constraints before proceeding with the DDL. In the future we can enhance the  
> ALTER  to automatically fix up the constraint definition in the catalog when 
> possible, and not raise error
> When spark adds support for DROP/REPLACE of columns they will impact 
> informational constraints.
> {code:sql}
> ALTER TABLE name DROP [COLUMN] column_name
> ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure

2020-01-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025639#comment-17025639
 ] 

Dongjoon Hyun commented on SPARK-28556:
---

[~smilegator]. How do we need to handle this? Do you want to add a deprecation 
note at 2.4.5?
2.4.5 is the last version before 3.0.0.

> Error should also be sent to QueryExecutionListener.onFailure
> -
>
> Key: SPARK-28556
> URL: https://issues.apache.org/jira/browse/SPARK-28556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Right now Error is not sent to QueryExecutionListener.onFailure. If there is 
> any Error when running a query, QueryExecutionListener.onFailure cannot be 
> triggered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3847) Enum.hashCode is only consistent within the same JVM

2020-01-28 Thread Kaspar Fischer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025631#comment-17025631
 ] 

Kaspar Fischer commented on SPARK-3847:
---

This issue is still present in Spark 2.4.0. The PR mentioned above didn’t 
actually result in a code change that got committed. 

> Enum.hashCode is only consistent within the same JVM
> 
>
> Key: SPARK-3847
> URL: https://issues.apache.org/jira/browse/SPARK-3847
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Oracle JDK 7u51 64bit on Ubuntu 12.04
>Reporter: Nathan Bijnens
>Priority: Major
>  Labels: bulk-closed, enum
>
> When using java Enum's as key in some operations the results will be very 
> unexpected. The issue is that the Java Enum.hashCode returns the 
> memoryposition, which is different on each JVM. 
> {code}
> messages.filter(_.getHeader.getKind == Kind.EVENT).count
> >> 503650
> val tmp = messages.filter(_.getHeader.getKind == Kind.EVENT)
> tmp.map(_.getHeader.getKind).countByValue
> >> Map(EVENT -> 1389)
> {code}
> Because it's actually a JVM issue we either should reject with an error enums 
> as key or implement a workaround.
> A good writeup of the issue can be found here (and a workaround):
> http://dev.bizo.com/2014/02/beware-enums-in-spark.html
> Somewhat more on the hash codes and Enum's:
> https://stackoverflow.com/questions/4885095/what-is-the-reason-behind-enum-hashcode
> And some issues (most of them rejected) at the Oracle Bug Java database:
> - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8050217
> - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7190798



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21823) ALTER TABLE table statements such as RENAME and CHANGE columns should raise error if there are any dependent constraints.

2020-01-28 Thread sakshi chourasia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025611#comment-17025611
 ] 

sakshi chourasia commented on SPARK-21823:
--

I work with spark sql. Need this command to alter the table name. I have some 
100+ table present in Prod and Dev env. May I know when we are planning to let 
this command out.

> ALTER TABLE table statements  such as RENAME and CHANGE columns should  raise 
>  error if there are any dependent constraints. 
> -
>
> Key: SPARK-21823
> URL: https://issues.apache.org/jira/browse/SPARK-21823
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> Following ALTER TABLE DDL statements will impact  the  informational 
> constraints defined on a table:
> {code:sql}
> ALTER TABLE name RENAME TO new_name
> ALTER TABLE name CHANGE column_name new_name new_type
> {code}
> Spark SQL should raise errors if there are 
 informational constraints 
> defined on the columns  affected by the ALTER  and let the user drop 
> constraints before proceeding with the DDL. In the future we can enhance the  
> ALTER  to automatically fix up the constraint definition in the catalog when 
> possible, and not raise error
> When spark adds support for DROP/REPLACE of columns they will impact 
> informational constraints.
> {code:sql}
> ALTER TABLE name DROP [COLUMN] column_name
> ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile

2020-01-28 Thread Abhishek Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Rao updated SPARK-30619:
-
Issue Type: Bug  (was: Question)

> org.slf4j.Logger and org.apache.commons.collections classes not built as part 
> of hadoop-provided profile
> 
>
> Key: SPARK-30619
> URL: https://issues.apache.org/jira/browse/SPARK-30619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.2, 2.4.4
> Environment: Spark on kubernetes
>Reporter: Abhishek Rao
>Priority: Major
>
> We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count 
> (org.apache.spark.examples.JavaWordCount) example on local files.
> But we're seeing that it is expecting org.slf4j.Logger and 
> org.apache.commons.collections classes to be available for executing this.
> We expected the binary to work as it is for local files. Is there anything 
> which we're missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30658) Limit after on streaming dataframe before streaming agg returns wrong results

2020-01-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025574#comment-17025574
 ] 

Dongjoon Hyun commented on SPARK-30658:
---

Hi, [~tdas]. Can we have `2.4.5` at `Target Version`, too?

> Limit after on streaming dataframe before streaming agg returns wrong results
> -
>
> Key: SPARK-30658
> URL: https://issues.apache.org/jira/browse/SPARK-30658
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Priority: Critical
>
> Limit before a streaming aggregate (i.e. [[df.limit(5).groupBy().count()}}) 
> in complete mode was not being planned as a streaming limit. The planner rule 
> planned a logical limit with a stateful streaming limit plan only if the 
> query is in append mode. As a result, instead of allowing max 5 rows across 
> batches, the planned streaming query was allowing 5 rows in every batch thus 
> producing incorrect results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error

2020-01-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025575#comment-17025575
 ] 

Dongjoon Hyun commented on SPARK-30657:
---

Hi, [~tdas]. Can we have `2.4.5` at `Target Version`, too?

> Streaming limit after streaming dropDuplicates can throw error
> --
>
> Key: SPARK-30657
> URL: https://issues.apache.org/jira/browse/SPARK-30657
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> {{LocalLimitExec}} does not consume the iterator of the child plan. So if 
> there is a limit after a stateful operator like streaming dedup in append 
> mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of 
> streaming duplicate may not be committed (most stateful ops commit state 
> changes only after the generated iterator is fully consumed). This leads to 
> the next batch failing with {{java.lang.IllegalStateException: Error reading 
> delta file .../N.delta does not exist}} as the state store delta file was 
> never generated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-28 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025549#comment-17025549
 ] 

Maxim Gekk commented on SPARK-30668:


Date/timestamp parsing is based on Java 8 DateTimeFormat in Spark 3.0 which may 
have different notion of pattern letters (see 
[https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]):
{code}
   V   time-zone IDzone-id   America/Los_Angeles; 
Z; -08:30
   z   time-zone name  zone-name Pacific Standard Time; 
PST
   O   localized zone-offset   offset-O  GMT+8; GMT+08:00; 
UTC-08:00;
   X   zone-offset 'Z' for zerooffset-X  Z; -08; -0830; -08:30; 
-083015; -08:30:15;
   x   zone-offset offset-x  +; -08; -0830; 
-08:30; -083015; -08:30:15;
   Z   zone-offset offset-Z  +; -0800; -08:00;
{code}
As you can see 'z' is for time zone name, but you is going to parse zone 
offsets. You can use 'x' or 'Z' in the pattern instead of 'z':
{code}
scala> spark.sql("""SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
"-MM-dd'T'HH:mm:ss.SSSZ")""").show(false)
++
|to_timestamp('2020-01-27T20:06:11.847-0800', '-MM-dd\'T\'HH:mm:ss.SSSZ')|
++
|2020-01-28 07:06:11.847 |
++
{code}

Parsing in Spark 2.4 is based on SimpleDateFormat (see 
https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html) 
where 'z' has slightly different meaning.

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0

2020-01-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30637:
-
Parent: (was: SPARK-23435)
Issue Type: Test  (was: Sub-task)

> upgrade testthat on jenkins workers to 2.0.0
> 
>
> Key: SPARK-30637
> URL: https://issues.apache.org/jira/browse/SPARK-30637
> Project: Spark
>  Issue Type: Test
>  Components: Build, jenkins, R
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> see:  https://issues.apache.org/jira/browse/SPARK-23435
> i will investigate upgrading testthat on my staging worker, and if that goes 
> smoothly we can upgrade it on all jenkins workers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23435) R tests should support latest testthat

2020-01-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23435:
-
Fix Version/s: 2.4.5

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30669) Introduce AdmissionControl API to Structured Streaming

2020-01-28 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30669:
---

 Summary: Introduce AdmissionControl API to Structured Streaming
 Key: SPARK-30669
 URL: https://issues.apache.org/jira/browse/SPARK-30669
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.4
Reporter: Burak Yavuz


In Structured Streaming, we have the concept of Triggers. With a trigger like 
Trigger.Once(), the semantics are to process all the data available to the 
datasource in a single micro-batch. However, this semantic can be broken when 
data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate 
limit the amount of data read for that micro-batch.

We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. A 
ReadLimit defines how much data should be read in the next micro-batch. 
`SupportsAdmissionControl` specifies that a source can rate limit its ingest 
into the system. The source can tell the system what the user specified as a 
read limit, and the system can enforce this limit within each micro-batch or 
impose it's own limit if the Trigger is Trigger.Once() for example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23435) R tests should support latest testthat

2020-01-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23435.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27359
[https://github.com/apache/spark/pull/27359]

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 3.0.0
>
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2020-01-28 Thread Sunitha Kambhampati (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025522#comment-17025522
 ] 

Sunitha Kambhampati commented on SPARK-27298:
-

1) Would it be possible to run your program with explain true to see the query 
plan in both the setups.  For e.g: 

maleExcptIncomeMatch.explain(true);

2) Also can you add this query in your application and get the output.

spark.sql("select count(*) from customer where Income is null and 
Gender='M'").show()

fwiw, I do not have the exact env that you have, but just wanted to add that I 
tried to run your repro on my mac and with spark 3.0 preview2 and the count 
shows up  as 148237.  

Interestingly, I observed that the difference that you are seeing in the count, 
actually matches the rows that have income null for Gender= 'M', which is 
18705.  

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.2
>Reporter: Mahima Khatri
>Priority: Blocker
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, 
> console-result-spark-2.4.2-windows, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*");
> Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Resolved] (SPARK-30481) Integrate event log compactor into Spark History Server

2020-01-28 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-30481.

Fix Version/s: 3.0.0
 Assignee: Jungtaek Lim
   Resolution: Fixed

> Integrate event log compactor into Spark History Server
> ---
>
> Key: SPARK-30481
> URL: https://issues.apache.org/jira/browse/SPARK-30481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue is to track the effort on compacting old event logs (and cleaning 
> up after compaction) without breaking guaranteeing of compatibility.
> This issue depends on SPARK-29779 and SPARK-30479, and focuses on integrating 
> event log compactor into Spark History Server and enable configurations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x

2020-01-28 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-30663:
---
Issue Type: Planned Work  (was: Bug)

> Remove 1.x testthat switch once Jenkins version is updated to 2.x
> -
>
> Key: SPARK-30663
> URL: https://issues.apache.org/jira/browse/SPARK-30663
> Project: Spark
>  Issue Type: Planned Work
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility 
> mode 
> {code}
> if (grepl("^1\\..*", packageVersion("testthat"))) {
>  NULL,  # testthat 1.x
>  "summary") test_runner <- 
> testthat:::run_tests
>   reporter <- "summary"
> } else {
>   # testthat >= 2.0.0
>   test_runner <- testthat:::test_package_dir
>   reporter <- testthat::default_reporter()
> }
> {code}
> in {{R/pkg/tests/run-all.R}}.
> It should be removed once whole infrastructure uses {{testhat}} 2.x or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-28 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-30668:
---

Assignee: (was: Xiao Li)

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-28 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-30668:
---

Assignee: Xiao Li

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-28 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025467#comment-17025467
 ] 

Xiao Li commented on SPARK-30668:
-

cc [~maxgekk]

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern

2020-01-28 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30668:

Summary: to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using 
pattern   (was: to_timestamp failed to parse 2020-01-27T20:06:11.847-0800)

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-28 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30668:

Summary: to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using 
pattern "-MM-dd'T'HH:mm:ss.SSSz"  (was: to_timestamp failed to parse 
2020-01-27T20:06:11.847-0800 using pattern )

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800

2020-01-28 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30668:

Description: 
{code:java}
SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
"-MM-dd'T'HH:mm:ss.SSSz")
{code}

This can return a valid value in Spark 2.4 but return NULL in the latest master


  was:

{code:java}
SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
"-MM-dd'T'HH:mm:ss.SSSz")
{code}

This can return a valid value by 2.4 but return NULL in the latest master



> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800
> -
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800

2020-01-28 Thread Xiao Li (Jira)
Xiao Li created SPARK-30668:
---

 Summary: to_timestamp failed to parse 2020-01-27T20:06:11.847-0800
 Key: SPARK-30668
 URL: https://issues.apache.org/jira/browse/SPARK-30668
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li



{code:java}
SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
"-MM-dd'T'HH:mm:ss.SSSz")
{code}

This can return a valid value by 2.4 but return NULL in the latest master




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30656) Support the "minPartitions" option in Kafka batch source and streaming source v1

2020-01-28 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-30656:
-
Issue Type: Improvement  (was: Bug)

> Support the "minPartitions" option in Kafka batch source and streaming source 
> v1
> 
>
> Key: SPARK-30656
> URL: https://issues.apache.org/jira/browse/SPARK-30656
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Right now, the "minPartitions" option only works in Kafka streaming source 
> v2. It would be great that we can support it in batch and streaming source v1 
> (v1 is the fallback mode when a user hits a regression in v2) as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30667) Support simple all gather in barrier task context

2020-01-28 Thread Sarth Frey (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025390#comment-17025390
 ] 

Sarth Frey commented on SPARK-30667:


I will work on this

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context

2020-01-28 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30667:
--
Description: 
Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
can see all IP addresses from BarrierTaskContext. It would be simpler to 
integrate with distributed frameworks like TensorFlow DistributionStrategy if 
we provide all gather that can let tasks share additional information with 
others, e.g., an available port.

Note that with all gather, tasks are share their IP addresses as well.

{code}
port = ... # get an available port
ports = context.all_gather(port) # get all available ports, ordered by task ID
...  # set up distributed training service
{code}

  was:
Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
can see all IP addresses from BarrierTaskContext. It would be simpler to 
integrate with distributed frameworks like TensorFlow DistributionStrategy if 
we provide all gather that can let tasks share additional information with 
others, e.g., an available port.

Note that with all gather, tasks are share their IP addresses as well.


> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30667) Support simple all gather in barrier task context

2020-01-28 Thread Xiangrui Meng (Jira)
Xiangrui Meng created SPARK-30667:
-

 Summary: Support simple all gather in barrier task context
 Key: SPARK-30667
 URL: https://issues.apache.org/jira/browse/SPARK-30667
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
can see all IP addresses from BarrierTaskContext. It would be simpler to 
integrate with distributed frameworks like TensorFlow DistributionStrategy if 
we provide all gather that can let tasks share additional information with 
others, e.g., an available port.

Note that with all gather, tasks are share their IP addresses as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-01-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30666:
--
Component/s: (was: SQL)
 Spark Core

> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code incrementing accumulators 
> also produces identical accumulator increments on success. Rerunning 
> partitions for any reason should always produce the same increments on 
> success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are compared to earlier increments. Depending on the strategy of how a new 
> increment updates over an earlier increment from the same partition, 
> different semantics of accumulators (here called accumulator modes) can be 
> implemented:
>  - ALL sums over all increments of each partition: this represents the 
> current implementation of accumulators
>  - MAX over all increments of each partition: assuming accumulators only 
> increment while a partition is processed, a successful task provides an 
> accumulator value that is always larger than any value of failed tasks, hence 
> it paramounts any failed task's value. This produces reliable accumulator 
> values. This should only be used in a single stage.
>  - LAST increment: allows to retrieve the latest increment for each partition 
> only.
> The implementation for MAX and LAST requires extra memory that scales with 
> the number of partitions. The current ALL implementation does not require 
> extra memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-01-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30666:
--
Description: 
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code incrementing accumulators 
also produces identical accumulator increments on success. Rerunning partitions 
for any reason should always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - ALL sums over all increments of each partition: this represents the current 
implementation of accumulators
 - MAX over all increments of each partition: assuming accumulators only 
increment while a partition is processed, a successful task provides an 
accumulator value that is always larger than any value of failed tasks, hence 
it paramounts any failed task's value. This produces reliable accumulator 
values. This should only be used in a single stage.
 - LAST increment: allows to retrieve the latest increment for each partition 
only.

The implementation for MAX and LAST requires extra memory that scales with the 
number of partitions. The current ALL implementation does not require extra 
memory.

  was:
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code incrementing accumulators 
also produces identical accumulator increments on success. Rerunning partitions 
for any reason should always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - SUM over all increments of each partition: this represents the current 
implementation of accumulators
 - MAX over all increments of each partition: assuming accumulators only 
increment while a partition is processed, a successful task provides an 
accumulator value that is always larger than any value of failed tasks, hence 
it paramounts any failed task's value. This produces reliable accumulator 
values. This should only be used in a single stage.
 - LAST increment: allows to retrieve the latest increment for each partition 
only.

The implementation for MAX and LAST requires extra memory that scales with the 
number of partitions. The current SUM implementation does not require extra 
memory.


> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code incrementing accumulators 
> also produces identical accumulator increments on success. Rerunning 
> partitions for any reason should always produce the same increments on 
> success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are compared to earlier increments. Depending on the strategy of how a new 
> increment updates over an earlier increment from the same partition, 
> different semantics of accumulators (here called accumulator modes) can be 
> implemented:
>  - ALL sums over all increments of each partition: this represents the 
> current implementation of accumulators
>  - MAX over all increments of each partition: assuming accumulators only 
> increment while a partition is processed, a successful task provides an 
> accumulator value that is always larger than any value of failed tasks, hence 
> it paramounts any failed task's value. This produces reliable accumulator 
> values. This should only be used in a single stage.
>  - LAST increment: allows to retrieve the latest increment for each partition 
> only.
> The implementation for MAX and LAST requires extra memory that scales with 
> the number of partitions. The current ALL implementation does not require 
> extra memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-m

[jira] [Created] (SPARK-30666) Reliable single-stage accumulators

2020-01-28 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-30666:
-

 Summary: Reliable single-stage accumulators
 Key: SPARK-30666
 URL: https://issues.apache.org/jira/browse/SPARK-30666
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Enrico Minack


This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code incrementing accumulators 
also produces identical accumulator increments on success. Rerunning partitions 
for any reason should always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - SUM over all increments of each partition: this represents the current 
implementation of accumulators
 - MAX over all increments of each partition: assuming accumulators only 
increment while a partition is processed, a successful task provides an 
accumulator value that is always larger than any value of failed tasks, hence 
it paramounts any failed task's value. This produces reliable accumulator 
values. This should only be used in a single stage.
 - LAST increment: allows to retrieve the latest increment for each partition 
only.

The implementation for MAX and LAST requires extra memory that scales with the 
number of partitions. The current SUM implementation does not require extra 
memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12378) CREATE EXTERNAL TABLE AS SELECT EXPORT AWS S3 ERROR

2020-01-28 Thread Ori Popowski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025383#comment-17025383
 ] 

Ori Popowski commented on SPARK-12378:
--

[~arun6445] there's a workaround here:

https://stackoverflow.com/a/59955511/1038182

> CREATE EXTERNAL TABLE AS SELECT EXPORT AWS S3 ERROR
> ---
>
> Key: SPARK-12378
> URL: https://issues.apache.org/jira/browse/SPARK-12378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: AWS EMR 4.2.0
> Just Master Running m3.xlarge
> Applications:
> Hive 1.0.0
> Spark 1.5.2
>Reporter: CESAR MICHELETTI
>Priority: Major
>
> I am receive the bellow error during try exporting data to AWS S3, in 
> spark-sql.
> Command:
> CREATE external TABLE export 
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
> -- lines terminated by '\n' 
>  STORED AS TEXTFILE
>  LOCATION 's3://xxx/yyy'
>  AS
> SELECT 
> xxx
> 
> (complete query)
> ;
> Error:
> -chgrp: '' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: '' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> 15/12/16 21:09:25 ERROR SparkSQLDriver: Failed in [CREATE external TABLE 
> csvexport
> ...
> (create table + query)
> ...
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:441)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:488)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:243)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at 
> org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
> at 
> or

[jira] [Created] (SPARK-30665) Remove Pandoc dependency in PySpark setup.py

2020-01-28 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30665:


 Summary: Remove Pandoc dependency in PySpark setup.py
 Key: SPARK-30665
 URL: https://issues.apache.org/jira/browse/SPARK-30665
 Project: Spark
  Issue Type: Improvement
  Components: Build, PySpark
Affects Versions: 2.4.4, 2.4.3
Reporter: Nicholas Chammas


PyPI now supports Markdown project descriptions, so we no longer need to 
convert the Spark README into ReStructuredText and thus no longer need pypandoc.

Removing pypandoc has the added benefit of eliminating the failure mode 
described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-28 Thread nabacg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025254#comment-17025254
 ] 

nabacg commented on SPARK-12312:


My suggestion was for people who need a working solution now and can't wait 
till there is a new Spark release out and/or can't easily upgrade their cluster 
installations (which happens in corporate multi-tenant situations, I've been 
there.. ).  

My approach avoids those problems by patching the JDBC driver, instead of 
Spark. It's not a long term solution, but perhaps will save someone some skin. 
I certainly worked for me and one of my clients. 

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-28 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025241#comment-17025241
 ] 

Gabor Somogyi commented on SPARK-12312:
---

[~nabacg] thanks, the approach is more or less clear but I'm creating and 
automated docker test which makes the PR hard. With manual test it's already 
working.

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30664) Add more metrics to the all stages page

2020-01-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30664:
--
Attachment: (was: Show Additional Metrics.png)

> Add more metrics to the all stages page
> ---
>
> Key: SPARK-30664
> URL: https://issues.apache.org/jira/browse/SPARK-30664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Minor
> Attachments: image-2020-01-28-16-12-49-807.png, 
> image-2020-01-28-16-13-36-174.png, image-2020-01-28-16-15-20-258.png
>
>
> The web UI page for individual stages has many useful metrics to diagnose 
> poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
> stages among hundreds or thousands of stages is cumbersome, as you have to 
> click through all stages on the all stages page. The all stages page should 
> host more metrics from the individual stages page like
>  - Peak Execution Memory
>  - Spill (Memory)
>  - Spill (Disk)
>  - GC Time
> These additional metrics would make the page more complex, so showing them 
> should be optional. The individual stages page hides some metrics under 
> !image-2020-01-28-16-12-49-807.png! . Those new metrics on the all stages 
> page should also be made optional in the same way.
> !image-2020-01-28-16-13-36-174.png!
> Existing metrics like
>  - Input
>  - Output
>  - Shuffle Read
>  - Shuffle Write
> could be made optional as well and active by default. Then users can remove 
> them if they want but get the same view as now by default.
> The table extends as additional metrics get checked / unchecked:
> !image-2020-01-28-16-15-20-258.png!
> Sorting the table by metrics allows to find the stages with highest GC time 
> or spilled bytes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30664) Add more metrics to the all stages page

2020-01-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30664:
--
Description: 
The web UI page for individual stages has many useful metrics to diagnose 
poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
stages among hundreds or thousands of stages is cumbersome, as you have to 
click through all stages on the all stages page. The all stages page should 
host more metrics from the individual stages page like
 - Peak Execution Memory
 - Spill (Memory)
 - Spill (Disk)
 - GC Time

These additional metrics would make the page more complex, so showing them 
should be optional. The individual stages page hides some metrics under 
!image-2020-01-28-16-12-49-807.png! . Those new metrics on the all stages page 
should also be made optional in the same way.

!image-2020-01-28-16-13-36-174.png!

Existing metrics like
 - Input
 - Output
 - Shuffle Read
 - Shuffle Write

could be made optional as well and active by default. Then users can remove 
them if they want but get the same view as now by default.

The table extends as additional metrics get checked / unchecked:
!image-2020-01-28-16-15-20-258.png!

Sorting the table by metrics allows to find the stages with highest GC time or 
spilled bytes.

  was:
The web UI page for individual stages has many useful metrics to diagnose 
poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
stages among hundreds or thousands of stages is cumbersome, as you have to 
click through all stages on the all stages page. The all stages page should 
host more metrics from the individual stages page like
- Peak Execution Memory
- Spill (Memory)
- Spill (Disk)
- GC Time

These additional metrics would make the page more complex, so showing them 
should be optional. The individual stages page hides some metrics under "Show 
Additional Metrics". Those new metrics on the all stages page should also be 
made optional in the same way.

Existing metrics like
- Input
- Output
- Shuffle Read
- Shuffle Write

could be made optional as well and active by default. Then users can remove 
them if they want but get the same view as now by default.


> Add more metrics to the all stages page
> ---
>
> Key: SPARK-30664
> URL: https://issues.apache.org/jira/browse/SPARK-30664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Minor
> Attachments: image-2020-01-28-16-12-49-807.png, 
> image-2020-01-28-16-13-36-174.png, image-2020-01-28-16-15-20-258.png
>
>
> The web UI page for individual stages has many useful metrics to diagnose 
> poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
> stages among hundreds or thousands of stages is cumbersome, as you have to 
> click through all stages on the all stages page. The all stages page should 
> host more metrics from the individual stages page like
>  - Peak Execution Memory
>  - Spill (Memory)
>  - Spill (Disk)
>  - GC Time
> These additional metrics would make the page more complex, so showing them 
> should be optional. The individual stages page hides some metrics under 
> !image-2020-01-28-16-12-49-807.png! . Those new metrics on the all stages 
> page should also be made optional in the same way.
> !image-2020-01-28-16-13-36-174.png!
> Existing metrics like
>  - Input
>  - Output
>  - Shuffle Read
>  - Shuffle Write
> could be made optional as well and active by default. Then users can remove 
> them if they want but get the same view as now by default.
> The table extends as additional metrics get checked / unchecked:
> !image-2020-01-28-16-15-20-258.png!
> Sorting the table by metrics allows to find the stages with highest GC time 
> or spilled bytes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30664) Add more metrics to the all stages page

2020-01-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30664:
--
Attachment: image-2020-01-28-16-15-20-258.png

> Add more metrics to the all stages page
> ---
>
> Key: SPARK-30664
> URL: https://issues.apache.org/jira/browse/SPARK-30664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Minor
> Attachments: Show Additional Metrics.png, 
> image-2020-01-28-16-12-49-807.png, image-2020-01-28-16-13-36-174.png, 
> image-2020-01-28-16-15-20-258.png
>
>
> The web UI page for individual stages has many useful metrics to diagnose 
> poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
> stages among hundreds or thousands of stages is cumbersome, as you have to 
> click through all stages on the all stages page. The all stages page should 
> host more metrics from the individual stages page like
> - Peak Execution Memory
> - Spill (Memory)
> - Spill (Disk)
> - GC Time
> These additional metrics would make the page more complex, so showing them 
> should be optional. The individual stages page hides some metrics under "Show 
> Additional Metrics". Those new metrics on the all stages page should also be 
> made optional in the same way.
> Existing metrics like
> - Input
> - Output
> - Shuffle Read
> - Shuffle Write
> could be made optional as well and active by default. Then users can remove 
> them if they want but get the same view as now by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30664) Add more metrics to the all stages page

2020-01-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30664:
--
Attachment: image-2020-01-28-16-13-36-174.png

> Add more metrics to the all stages page
> ---
>
> Key: SPARK-30664
> URL: https://issues.apache.org/jira/browse/SPARK-30664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Minor
> Attachments: Show Additional Metrics.png, 
> image-2020-01-28-16-12-49-807.png, image-2020-01-28-16-13-36-174.png
>
>
> The web UI page for individual stages has many useful metrics to diagnose 
> poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
> stages among hundreds or thousands of stages is cumbersome, as you have to 
> click through all stages on the all stages page. The all stages page should 
> host more metrics from the individual stages page like
> - Peak Execution Memory
> - Spill (Memory)
> - Spill (Disk)
> - GC Time
> These additional metrics would make the page more complex, so showing them 
> should be optional. The individual stages page hides some metrics under "Show 
> Additional Metrics". Those new metrics on the all stages page should also be 
> made optional in the same way.
> Existing metrics like
> - Input
> - Output
> - Shuffle Read
> - Shuffle Write
> could be made optional as well and active by default. Then users can remove 
> them if they want but get the same view as now by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30664) Add more metrics to the all stages page

2020-01-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30664:
--
Attachment: image-2020-01-28-16-12-49-807.png

> Add more metrics to the all stages page
> ---
>
> Key: SPARK-30664
> URL: https://issues.apache.org/jira/browse/SPARK-30664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Minor
> Attachments: Show Additional Metrics.png, 
> image-2020-01-28-16-12-49-807.png
>
>
> The web UI page for individual stages has many useful metrics to diagnose 
> poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
> stages among hundreds or thousands of stages is cumbersome, as you have to 
> click through all stages on the all stages page. The all stages page should 
> host more metrics from the individual stages page like
> - Peak Execution Memory
> - Spill (Memory)
> - Spill (Disk)
> - GC Time
> These additional metrics would make the page more complex, so showing them 
> should be optional. The individual stages page hides some metrics under "Show 
> Additional Metrics". Those new metrics on the all stages page should also be 
> made optional in the same way.
> Existing metrics like
> - Input
> - Output
> - Shuffle Read
> - Shuffle Write
> could be made optional as well and active by default. Then users can remove 
> them if they want but get the same view as now by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30664) Add more metrics to the all stages page

2020-01-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30664:
--
Attachment: Show Additional Metrics.png

> Add more metrics to the all stages page
> ---
>
> Key: SPARK-30664
> URL: https://issues.apache.org/jira/browse/SPARK-30664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Minor
> Attachments: Show Additional Metrics.png
>
>
> The web UI page for individual stages has many useful metrics to diagnose 
> poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
> stages among hundreds or thousands of stages is cumbersome, as you have to 
> click through all stages on the all stages page. The all stages page should 
> host more metrics from the individual stages page like
> - Peak Execution Memory
> - Spill (Memory)
> - Spill (Disk)
> - GC Time
> These additional metrics would make the page more complex, so showing them 
> should be optional. The individual stages page hides some metrics under "Show 
> Additional Metrics". Those new metrics on the all stages page should also be 
> made optional in the same way.
> Existing metrics like
> - Input
> - Output
> - Shuffle Read
> - Shuffle Write
> could be made optional as well and active by default. Then users can remove 
> them if they want but get the same view as now by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30664) Add more metrics to the all stages page

2020-01-28 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-30664:
-

 Summary: Add more metrics to the all stages page
 Key: SPARK-30664
 URL: https://issues.apache.org/jira/browse/SPARK-30664
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Enrico Minack


The web UI page for individual stages has many useful metrics to diagnose 
poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
stages among hundreds or thousands of stages is cumbersome, as you have to 
click through all stages on the all stages page. The all stages page should 
host more metrics from the individual stages page like
- Peak Execution Memory
- Spill (Memory)
- Spill (Disk)
- GC Time

These additional metrics would make the page more complex, so showing them 
should be optional. The individual stages page hides some metrics under "Show 
Additional Metrics". Those new metrics on the all stages page should also be 
made optional in the same way.

Existing metrics like
- Input
- Output
- Shuffle Read
- Shuffle Write

could be made optional as well and active by default. Then users can remove 
them if they want but get the same view as now by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-28 Thread nabacg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025176#comment-17025176
 ] 

nabacg commented on SPARK-12312:


I originally got around this problem using the approach sketched in this repo: 

[https://github.com/nabacg/krb5sqljdb]

Hope it might help someone.. it saved my project from missing an important 
deadline when we discovered this issue. Unfortunately as you say it's not very 
well documented and you often tend to discover this problem in later stage of 
the project, like when moving from DEV to UAT/PROD.. 

 

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x

2020-01-28 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-30663:
---
Description: 
As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility mode 

{code}
if (grepl("^1\\..*", packageVersion("testthat"))) {
 NULL,# testthat 1.x
 "summary")   test_runner <- 
testthat:::run_tests
  reporter <- "summary"

} else {
  # testthat >= 2.0.0
  test_runner <- testthat:::test_package_dir
  reporter <- testthat::default_reporter()
}
{code}

in {{R/pkg/tests/run-all.R}}.

It should be removed once whole infrastructure uses {{testhat}} 2.x or later.

  was:
As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility mode 

{code}

if (grepl("^1\\..*", packageVersion("testthat"))) {
 NULL,# testthat 1.x
 "summary")   test_runner <- 
testthat:::run_tests
  reporter <- "summary"

} else {
  # testthat >= 2.0.0
  test_runner <- testthat:::test_package_dir
  reporter <- testthat::default_reporter()
}
{code}

It should be removed once whole infrastructure uses {{testhat}} 2.x or later.


> Remove 1.x testthat switch once Jenkins version is updated to 2.x
> -
>
> Key: SPARK-30663
> URL: https://issues.apache.org/jira/browse/SPARK-30663
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility 
> mode 
> {code}
> if (grepl("^1\\..*", packageVersion("testthat"))) {
>  NULL,  # testthat 1.x
>  "summary") test_runner <- 
> testthat:::run_tests
>   reporter <- "summary"
> } else {
>   # testthat >= 2.0.0
>   test_runner <- testthat:::test_package_dir
>   reporter <- testthat::default_reporter()
> }
> {code}
> in {{R/pkg/tests/run-all.R}}.
> It should be removed once whole infrastructure uses {{testhat}} 2.x or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x

2020-01-28 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-30663:
--

 Summary: Remove 1.x testthat switch once Jenkins version is 
updated to 2.x
 Key: SPARK-30663
 URL: https://issues.apache.org/jira/browse/SPARK-30663
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Tests
Affects Versions: 3.0.0
Reporter: Maciej Szymkiewicz


As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility mode 

{code}

if (grepl("^1\\..*", packageVersion("testthat"))) {
 NULL,# testthat 1.x
 "summary")   test_runner <- 
testthat:::run_tests
  reporter <- "summary"

} else {
  # testthat >= 2.0.0
  test_runner <- testthat:::test_package_dir
  reporter <- testthat::default_reporter()
}
{code}

It should be removed once whole infrastructure uses {{testhat}} 2.x or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-28 Thread John Lonergan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025145#comment-17025145
 ] 

John Lonergan commented on SPARK-12312:
---

Re documentation - it really would be helpful and avoid time wasting if the 
documentation wrre updated to indicate that it is not possible to connect to a 
kerberisied JDBC at the moment.  

This could be in the docs Foster mentions above and possibly also in 
"Troubleshooting"

Can we fix the documentation first please as that seems low hanging fruit.

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30662) ALS/MLP extend HasBlockSize

2020-01-28 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30662:


 Summary: ALS/MLP extend HasBlockSize
 Key: SPARK-30662
 URL: https://issues.apache.org/jira/browse/SPARK-30662
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30661) KMeans blockify input vectors

2020-01-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30661:


Assignee: zhengruifeng

> KMeans blockify input vectors
> -
>
> Key: SPARK-30661
> URL: https://issues.apache.org/jira/browse/SPARK-30661
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30661) KMeans blockify input vectors

2020-01-28 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30661:


 Summary: KMeans blockify input vectors
 Key: SPARK-30661
 URL: https://issues.apache.org/jira/browse/SPARK-30661
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30660) LinearRegression blockify input vectors

2020-01-28 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30660:


 Summary: LinearRegression blockify input vectors
 Key: SPARK-30660
 URL: https://issues.apache.org/jira/browse/SPARK-30660
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2020-01-28 Thread Javier Fuentes (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025117#comment-17025117
 ] 

Javier Fuentes commented on SPARK-28067:


The expected result here should be an ArithmeticException for all cases of 
precision > DecimalType.MAX_PRECISION?

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Mark Sirek
>Priority: Blocker
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an 
> exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
> exceeds max precision 38
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30659) LogisticRegression blockify input vectors

2020-01-28 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30659:


 Summary: LogisticRegression blockify input vectors
 Key: SPARK-30659
 URL: https://issues.apache.org/jira/browse/SPARK-30659
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng
Assignee: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29292) Fix internal usages of mutable collection for Seq in 2.13

2020-01-28 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025099#comment-17025099
 ] 

Sean R. Owen commented on SPARK-29292:
--

I do, I have a large (190 files) PR that fixes all of these currently.
I am not sure whether we want to get it in for 3.0 - what do you think? it 
_might_ have perf implications for 2.12, though most .toSeq calls ought to be a 
no-op. WDYT?

> Fix internal usages of mutable collection for Seq in 2.13
> -
>
> Key: SPARK-29292
> URL: https://issues.apache.org/jira/browse/SPARK-29292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Kind of related to https://issues.apache.org/jira/browse/SPARK-27681, but a 
> simpler subset. 
> In 2.13, a mutable collection can't be returned as a 
> {{scala.collection.Seq}}. It's easy enough to call .toSeq on these as that 
> still works on 2.12.
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:467:
>  type mismatch;
>  found   : Seq[String] (in scala.collection) 
>  required: Seq[String] (in scala.collection.immutable) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30642) LinearSVC blockify input vectors

2020-01-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30642.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27360
[https://github.com/apache/spark/pull/27360]

> LinearSVC blockify input vectors
> 
>
> Key: SPARK-30642
> URL: https://issues.apache.org/jira/browse/SPARK-30642
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30642) LinearSVC blockify input vectors

2020-01-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30642:


Assignee: zhengruifeng

> LinearSVC blockify input vectors
> 
>
> Key: SPARK-30642
> URL: https://issues.apache.org/jira/browse/SPARK-30642
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-28 Thread Sivakumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024981#comment-17024981
 ] 

Sivakumar commented on SPARK-30542:
---

Sure, Thanks Hyukjin

> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> Spark Structured Streaming doesn't allow two structured streaming jobs to 
> write data to the same base directory which is possible with using dstreams.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30658) Limit after on streaming dataframe before streaming agg returns wrong results

2020-01-28 Thread Tathagata Das (Jira)
Tathagata Das created SPARK-30658:
-

 Summary: Limit after on streaming dataframe before streaming agg 
returns wrong results
 Key: SPARK-30658
 URL: https://issues.apache.org/jira/browse/SPARK-30658
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 2.3.4, 2.3.3, 2.3.2, 
2.3.1, 2.3.0
Reporter: Tathagata Das


Limit before a streaming aggregate (i.e. [[df.limit(5).groupBy().count()}}) in 
complete mode was not being planned as a streaming limit. The planner rule 
planned a logical limit with a stateful streaming limit plan only if the query 
is in append mode. As a result, instead of allowing max 5 rows across batches, 
the planned streaming query was allowing 5 rows in every batch thus producing 
incorrect results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error

2020-01-28 Thread Tathagata Das (Jira)
Tathagata Das created SPARK-30657:
-

 Summary: Streaming limit after streaming dropDuplicates can throw 
error
 Key: SPARK-30657
 URL: https://issues.apache.org/jira/browse/SPARK-30657
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 2.3.4, 2.3.3, 2.3.2, 
2.3.1, 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das


{{LocalLimitExec}} does not consume the iterator of the child plan. So if there 
is a limit after a stateful operator like streaming dedup in append mode (e.g. 
{{streamingdf.dropDuplicates().limit(5}})), the state changes of streaming 
duplicate may not be committed (most stateful ops commit state changes only 
after the generated iterator is fully consumed). This leads to the next batch 
failing with {{java.lang.IllegalStateException: Error reading delta file 
.../N.delta does not exist}} as the state store delta file was never generated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23829) spark-sql-kafka source in spark 2.3 causes reading stream failure frequently

2020-01-28 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024973#comment-17024973
 ] 

Gabor Somogyi commented on SPARK-23829:
---

Cannot fetch record means Spark initiated the fetch but was timing out. First I 
would take a look at what happened on the Kafka side (since Kafka didn't 
respond in time). If you still think it's a Spark issue I would like to 
reproduce it with vanilla Spark + create a new jira with logs.

> spark-sql-kafka source in spark 2.3 causes reading stream failure frequently
> 
>
> Key: SPARK-23829
> URL: https://issues.apache.org/jira/browse/SPARK-23829
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Norman Bai
>Priority: Major
> Fix For: 2.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In spark 2.3 , it provides a source "spark-sql-kafka-0-10_2.11".
>  
> When I wanted to read from my kafka-0.10.2.1 cluster, it throws out an error 
> "*java.util.concurrent.TimeoutException: Cannot fetch record  for offset 
> in 12000 milliseconds*"  frequently , and the job thus failed.
>  
> I searched on google & stackoverflow for a while, and found many other people 
> who got this excption too, and nobody gave an answer.
>  
> I debuged the source code, found nothing, but I guess it's because the 
> dependency spark-sql-kafka-0-10_2.11 is using.
>  
> {code:java}
> 
>  org.apache.spark
>  spark-sql-kafka-0-10_2.11
>  2.3.0
>  
>  
>  kafka-clients
>  org.apache.kafka
>  
>  
> 
> 
>  org.apache.kafka
>  kafka-clients
>  0.10.2.1
> {code}
> I excluded it from maven ,and added another version , rerun the code , and 
> now it works.
>  
> I guess something is wrong on kafka-clients0.10.0.1 working with 
> kafka0.10.2.1, or more kafka versions. 
>  
> Hope for an explanation.
> Here is the error stack.
> {code:java}
> [ERROR] 2018-03-30 13:34:11,404 [stream execution thread for [id = 
> 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = 
> b3e18aa6-358f-43f6-a077-e34db0822df6]] 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution logError - Query 
> [id = 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = 
> b3e18aa6-358f-43f6-a077-e34db0822df6] terminated with error
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 
> (TID 6, localhost, executor driver): java.util.concurrent.TimeoutException: 
> Cannot fetch record for offset 6481521 in 12 milliseconds
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:230)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
> at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107)
> at 
> o