date:20200414

[jira] [Commented] (SPARK-31336) Support Oracle Kerberos login in JDBC connector

2020-04-14 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082931#comment-17082931
 ] 

Gabor Somogyi commented on SPARK-31336:
---

Started to work on this...

> Support Oracle Kerberos login in JDBC connector
> ---
>
> Key: SPARK-31336
> URL: https://issues.apache.org/jira/browse/SPARK-31336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31425) UnsafeKVExternalSorter/VariableLengthRowBasedKeyValueBatch should also respect UnsafeAlignedOffset

2020-04-14 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31425:
-
Summary: UnsafeKVExternalSorter/VariableLengthRowBasedKeyValueBatch should 
also respect UnsafeAlignedOffset  (was: UnsafeKVExternalSorter should also 
respect UnsafeAlignedOffset)

> UnsafeKVExternalSorter/VariableLengthRowBasedKeyValueBatch should also 
> respect UnsafeAlignedOffset
> --
>
> Key: SPARK-31425
> URL: https://issues.apache.org/jira/browse/SPARK-31425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: wuyi
>Priority: Major
>
> Since BytesToBytesMap respects UnsafeAlignedOffset when writing the record, 
> UnsafeKVExternalSorter should also respect UnsafeAlignedOffset when reading 
> the record otherwise it will causes a data correctness issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied

2020-04-14 Thread Hongze Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082985#comment-17082985
 ] 

Hongze Zhang commented on SPARK-31437:
--

Hi [~tgraves], is there already any plan on this or similar topics? I roughly 
searched JIRA and Git history and didn't found enough information.

> Try assigning tasks to existing executors by which required resources in 
> ResourceProfile are satisfied
> --
>
> Key: SPARK-31437
> URL: https://issues.apache.org/jira/browse/SPARK-31437
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Hongze Zhang
>Priority: Major
>
> By the change in [PR|https://github.com/apache/spark/pull/27773] of 
> SPARK-29154, submitted tasks are scheduled onto executors only if resource 
> profile IDs strictly match. As a result Spark always starts new executors for 
> customized ResourceProfiles.
> This limitation makes working with process-local jobs unfriendly. E.g. Task 
> cores has been increased from 1 to 4 in a new stage, and executor has 8 
> slots, it is expected that 2 new tasks can be run on the existing executor 
> but Spark starts new executors for new ResourceProfile. The behavior is 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied

2020-04-14 Thread Hongze Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082985#comment-17082985
 ] 

Hongze Zhang edited comment on SPARK-31437 at 4/14/20, 8:28 AM:


Hi [~tgraves], is there already any plan on this or similar topics? I roughly 
searched JIRA and Git history and didn't find enough information.


was (Author: zhztheplayer):
Hi [~tgraves], is there already any plan on this or similar topics? I roughly 
searched JIRA and Git history and didn't found enough information.

> Try assigning tasks to existing executors by which required resources in 
> ResourceProfile are satisfied
> --
>
> Key: SPARK-31437
> URL: https://issues.apache.org/jira/browse/SPARK-31437
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Hongze Zhang
>Priority: Major
>
> By the change in [PR|https://github.com/apache/spark/pull/27773] of 
> SPARK-29154, submitted tasks are scheduled onto executors only if resource 
> profile IDs strictly match. As a result Spark always starts new executors for 
> customized ResourceProfiles.
> This limitation makes working with process-local jobs unfriendly. E.g. Task 
> cores has been increased from 1 to 4 in a new stage, and executor has 8 
> slots, it is expected that 2 new tasks can be run on the existing executor 
> but Spark starts new executors for new ResourceProfile. The behavior is 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31442) Print shuffle id at coalesce partitions target size

2020-04-14 Thread ulysses you (Jira)

ulysses you created SPARK-31442:
---

 Summary: Print shuffle id at coalesce partitions target size
 Key: SPARK-31442
 URL: https://issues.apache.org/jira/browse/SPARK-31442
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-14 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083024#comment-17083024
 ] 

Takeshi Yamamuro commented on SPARK-31429:
--

[~hyukjin.kwon] You've already started to work on this? If no one works yet, I 
have time in the latter half of this week so I can take this.

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27296) Efficient User Defined Aggregators

2020-04-14 Thread Patrick Cording (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083098#comment-17083098
 ] 

Patrick Cording commented on SPARK-27296:
-

I've been trying this out, and I have a couple of questions.

First, here's a snippet showing what I did.
{code:java}
class LongestRunUdaf extends UserDefinedAggregateFunction {
  // ...
}

class LongestRunAggregator extends Aggregator[String, LongestRunBuffer, 
Option[Run]] {
  // ...
}

// This is to get a reference. It is slow.
val lrUdaf = new LongestRunUdaf
val result = df.select(lrUdaf(col("value")))

// This is many times faster, but `res` is now a `DataFrame`
val longestRunAggregator = udaf(new LongestRunAggregator, Encoders.STRING)
val res = dataset.select(longestRunAggregator(col("value")))

// This creates a `Dataset[Option[Run]]` as needed, but now it is as slow as 
the old UDAF
res.as(Encoders.kryo[Option[Run]])

// Furthermore, this is still as slow as before
val longestRunAggregator = (new LongestRunAggregator).toColumn
dataset.select(longestRunAggregator).first{code}
I am confused by the fact that the performance improvement only is visible when 
using UDAFs, but in order to get there, I need to define an Aggregator, which 
is still slow if I use it directly. Is it not possible to also achieve the 
improvement for the type-safe cases where Aggregators are used directly?

Also, wrapping the Aggregator where the output is of a non-standard type leaves 
you with a DataFrame with binary data. Converting it to a Dataset seems to 
negate the performance improvement. Is this intended behaviour?

Or am I just using this in the wrong way?

My full experiment is here: [https://github.com/patrickcording/udaf-benchmark]

> Efficient User Defined Aggregators 
> ---
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
> Fix For: 3.0.0
>
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27296) Efficient User Defined Aggregators

2020-04-14 Thread Patrick Cording (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083098#comment-17083098
 ] 

Patrick Cording edited comment on SPARK-27296 at 4/14/20, 10:42 AM:


I've been trying this out, and I have a couple of questions.

First, here's a snippet showing what I did.
{code:java}
class LongestRunUdaf extends UserDefinedAggregateFunction {
  // ...
}

class LongestRunAggregator extends Aggregator[String, LongestRunBuffer, 
Option[Run]] {
  // ...
}

// This is to get a reference. It is slow.
val lrUdaf = new LongestRunUdaf
val result = df.select(lrUdaf(col("value")))

// This is many times faster, but `res` is now a `DataFrame`
val longestRunAggregator = udaf(new LongestRunAggregator, Encoders.STRING)
val res = dataset.select(longestRunAggregator(col("value")))
res.show()

// This creates a `Dataset[Option[Run]]` as needed, but now it is as slow as 
the old UDAF
res.as(Encoders.kryo[Option[Run]])

// Furthermore, this is still as slow as before
val longestRunAggregator = (new LongestRunAggregator).toColumn
dataset.select(longestRunAggregator).first{code}
I am confused by the fact that the performance improvement only is visible when 
using UDAFs, but in order to get there, I need to define an Aggregator, which 
is still slow if I use it directly. Is it not possible to also achieve the 
improvement for the type-safe cases where Aggregators are used directly?

Also, wrapping the Aggregator where the output is of a non-standard type leaves 
you with a DataFrame with binary data. Converting it to a Dataset seems to 
negate the performance improvement. Is this intended behaviour?

Or am I just using this in the wrong way?

My full experiment is here: [https://github.com/patrickcording/udaf-benchmark]


was (Author: cording):
I've been trying this out, and I have a couple of questions.

First, here's a snippet showing what I did.
{code:java}
class LongestRunUdaf extends UserDefinedAggregateFunction {
  // ...
}

class LongestRunAggregator extends Aggregator[String, LongestRunBuffer, 
Option[Run]] {
  // ...
}

// This is to get a reference. It is slow.
val lrUdaf = new LongestRunUdaf
val result = df.select(lrUdaf(col("value")))

// This is many times faster, but `res` is now a `DataFrame`
val longestRunAggregator = udaf(new LongestRunAggregator, Encoders.STRING)
val res = dataset.select(longestRunAggregator(col("value")))

// This creates a `Dataset[Option[Run]]` as needed, but now it is as slow as 
the old UDAF
res.as(Encoders.kryo[Option[Run]])

// Furthermore, this is still as slow as before
val longestRunAggregator = (new LongestRunAggregator).toColumn
dataset.select(longestRunAggregator).first{code}
I am confused by the fact that the performance improvement only is visible when 
using UDAFs, but in order to get there, I need to define an Aggregator, which 
is still slow if I use it directly. Is it not possible to also achieve the 
improvement for the type-safe cases where Aggregators are used directly?

Also, wrapping the Aggregator where the output is of a non-standard type leaves 
you with a DataFrame with binary data. Converting it to a Dataset seems to 
negate the performance improvement. Is this intended behaviour?

Or am I just using this in the wrong way?

My full experiment is here: [https://github.com/patrickcording/udaf-benchmark]

> Efficient User Defined Aggregators 
> ---
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
> Fix For: 3.0.0
>
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF having its own non trivial 
> internal structure it is obviously that much worse.

[jira] [Comment Edited] (SPARK-27296) Efficient User Defined Aggregators

2020-04-14 Thread Patrick Cording (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083098#comment-17083098
 ] 

Patrick Cording edited comment on SPARK-27296 at 4/14/20, 10:43 AM:


I've been trying this out, and I have a couple of questions.

First, here's a snippet showing what I did.
{code:java}
class LongestRunUdaf extends UserDefinedAggregateFunction {
  // ...
}

class LongestRunAggregator extends Aggregator[String, LongestRunBuffer, 
Option[Run]] {
  // ...
}

// This is to get a reference. It is slow.
val lrUdaf = new LongestRunUdaf
val result = df.select(lrUdaf(col("value")))

// This is many times faster, but `res` is now a `DataFrame`
val longestRunAggregator = udaf(new LongestRunAggregator, Encoders.STRING)
val res = dataset.select(longestRunAggregator(col("value")))
res.show()

// This creates a `Dataset[Option[Run]]` as needed, but now it is as slow as 
the old UDAF
res.as(Encoders.kryo[Option[Run]])

// Furthermore, this is still as slow as before
val longestRunAggregator = (new LongestRunAggregator).toColumn
dataset.select(longestRunAggregator).first{code}
I am confused by the fact that the performance improvement only is visible when 
using UDAFs, but in order to get there, I need to define an Aggregator, which 
is still slow if I use it directly. Is it not possible to also achieve the 
improvement for the type-safe cases where Aggregators are used directly?

Also, wrapping the Aggregator where the output is of a non-standard type leaves 
you with a DataFrame with binary data. Converting it to a Dataset seems to 
negate the performance improvement. Is this intended behaviour? In my example, 
it should just be one row that should be deserialized.

Or am I just using this in the wrong way?

My full experiment is here: [https://github.com/patrickcording/udaf-benchmark]


was (Author: cording):
I've been trying this out, and I have a couple of questions.

First, here's a snippet showing what I did.
{code:java}
class LongestRunUdaf extends UserDefinedAggregateFunction {
  // ...
}

class LongestRunAggregator extends Aggregator[String, LongestRunBuffer, 
Option[Run]] {
  // ...
}

// This is to get a reference. It is slow.
val lrUdaf = new LongestRunUdaf
val result = df.select(lrUdaf(col("value")))

// This is many times faster, but `res` is now a `DataFrame`
val longestRunAggregator = udaf(new LongestRunAggregator, Encoders.STRING)
val res = dataset.select(longestRunAggregator(col("value")))
res.show()

// This creates a `Dataset[Option[Run]]` as needed, but now it is as slow as 
the old UDAF
res.as(Encoders.kryo[Option[Run]])

// Furthermore, this is still as slow as before
val longestRunAggregator = (new LongestRunAggregator).toColumn
dataset.select(longestRunAggregator).first{code}
I am confused by the fact that the performance improvement only is visible when 
using UDAFs, but in order to get there, I need to define an Aggregator, which 
is still slow if I use it directly. Is it not possible to also achieve the 
improvement for the type-safe cases where Aggregators are used directly?

Also, wrapping the Aggregator where the output is of a non-standard type leaves 
you with a DataFrame with binary data. Converting it to a Dataset seems to 
negate the performance improvement. Is this intended behaviour?

Or am I just using this in the wrong way?

My full experiment is here: [https://github.com/patrickcording/udaf-benchmark]

> Efficient User Defined Aggregators 
> ---
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
> Fix For: 3.0.0
>
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including but not limited to Row 
> object allocations. For a more realistic UDAF

[jira] [Comment Edited] (SPARK-27296) Efficient User Defined Aggregators

2020-04-14 Thread Patrick Cording (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083098#comment-17083098
 ] 

Patrick Cording edited comment on SPARK-27296 at 4/14/20, 10:48 AM:


I've been trying this out, and I have a couple of questions.

First, here's a snippet showing what I did.
{code:java}
class LongestRunUdaf extends UserDefinedAggregateFunction {
  // ...
}

class LongestRunAggregator extends Aggregator[String, LongestRunBuffer, 
Option[Run]] {
  // ...
}

// This is to get a reference. It is slow.
val lrUdaf = new LongestRunUdaf
val result = df.select(lrUdaf(col("value")))

// This is many times faster, but `res` is now a `DataFrame`
val longestRunAggregator = udaf(new LongestRunAggregator, Encoders.STRING)
val res = dataset.select(longestRunAggregator(col("value")))
res.show()

// This creates a `Dataset[Option[Run]]` as needed, but now it is as slow as 
the old UDAF
res.as(Encoders.kryo[Option[Run]])

// Furthermore, this is still as slow as before
val longestRunAggregator = (new LongestRunAggregator).toColumn
dataset.select(longestRunAggregator).first{code}
I am confused by the fact that the performance improvement only is visible when 
using UDAFs, but in order to get there, I need to define an Aggregator, which 
is still slow if I use it directly. Is it not possible to also achieve the 
improvement for the type-safe cases where Aggregators are used directly?

Also, wrapping an Aggregator where the output is of a non-standard type leaves 
you with a DataFrame with binary data. Converting it to a Dataset seems to 
negate the performance improvement. Is this intended behaviour? In my example, 
it should just be one row that should be deserialized.

Or am I just using this in the wrong way?

My full experiment is here: [https://github.com/patrickcording/udaf-benchmark]


was (Author: cording):
I've been trying this out, and I have a couple of questions.

First, here's a snippet showing what I did.
{code:java}
class LongestRunUdaf extends UserDefinedAggregateFunction {
  // ...
}

class LongestRunAggregator extends Aggregator[String, LongestRunBuffer, 
Option[Run]] {
  // ...
}

// This is to get a reference. It is slow.
val lrUdaf = new LongestRunUdaf
val result = df.select(lrUdaf(col("value")))

// This is many times faster, but `res` is now a `DataFrame`
val longestRunAggregator = udaf(new LongestRunAggregator, Encoders.STRING)
val res = dataset.select(longestRunAggregator(col("value")))
res.show()

// This creates a `Dataset[Option[Run]]` as needed, but now it is as slow as 
the old UDAF
res.as(Encoders.kryo[Option[Run]])

// Furthermore, this is still as slow as before
val longestRunAggregator = (new LongestRunAggregator).toColumn
dataset.select(longestRunAggregator).first{code}
I am confused by the fact that the performance improvement only is visible when 
using UDAFs, but in order to get there, I need to define an Aggregator, which 
is still slow if I use it directly. Is it not possible to also achieve the 
improvement for the type-safe cases where Aggregators are used directly?

Also, wrapping the Aggregator where the output is of a non-standard type leaves 
you with a DataFrame with binary data. Converting it to a Dataset seems to 
negate the performance improvement. Is this intended behaviour? In my example, 
it should just be one row that should be deserialized.

Or am I just using this in the wrong way?

My full experiment is here: [https://github.com/patrickcording/udaf-benchmark]

> Efficient User Defined Aggregators 
> ---
>
> Key: SPARK-27296
> URL: https://issues.apache.org/jira/browse/SPARK-27296
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
>  Labels: performance, usability
> Fix For: 3.0.0
>
>
> Spark's UDAFs appear to be serializing and de-serializing to/from the 
> MutableAggregationBuffer for each row.  This gist shows a small reproducing 
> UDAF and a spark shell session:
> [https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f]
> The UDAF and its compantion UDT are designed to count the number of times 
> that ser/de is invoked for the aggregator.  The spark shell session 
> demonstrates that it is executing ser/de on every row of the data frame.
> Note, Spark's pre-defined aggregators do not have this problem, as they are 
> based on an internal aggregating trait that does the correct thing and only 
> calls ser/de at points such as partition boundaries, presenting final 
> results, etc.
> This is a major problem for UDAFs, as it means that every UDAF is doing a 
> massive amount of unnecessary work per row, including b

[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083154#comment-17083154
 ] 

Hyukjin Kwon commented on SPARK-31429:
--

I haven't started it yet. Please go ahead, I would appreciate it!

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-14 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083156#comment-17083156
 ] 

Takeshi Yamamuro commented on SPARK-31429:
--

Thanks!

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31443) Perf regression of toJavaDate

2020-04-14 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31443:
--

 Summary: Perf regression of toJavaDate
 Key: SPARK-31443
 URL: https://issues.apache.org/jira/browse/SPARK-31443
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


DateTimeBenchmark shows the regression

Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date  614655 
> 43  8.1 122.8   1.0X
{code}

Current master:
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date 1154   1206 
> 46  4.3 230.9   1.0X
{code}

The regression is ~x2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31443) Perf regression of toJavaDate

2020-04-14 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083217#comment-17083217
 ] 

Maxim Gekk commented on SPARK-31443:


FYI [~cloud_fan]

> Perf regression of toJavaDate
> -
>
> Key: SPARK-31443
> URL: https://issues.apache.org/jira/browse/SPARK-31443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> DateTimeBenchmark shows the regression
> Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27]
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date  559603
>   38  8.9 111.8   1.0X
> Collect dates  2306   3221
> 1558  2.2 461.1   0.2X
> {code}
> Current master:
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date 1052   1130
>   73  4.8 210.3   1.0X
> Collect dates  3251   4943
> 1624  1.5 650.2   0.3X
> {code}
> If we subtract preparing DATE column:
> * Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row
> * master is (650.2 - 210.3) = 439 ns/row
> The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - 
> 349.3)/349.3 = 25%



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31443) Perf regression of toJavaDate

2020-04-14 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31443:
---
Description: 
DateTimeBenchmark shows the regression

Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27]
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date  559603 
> 38  8.9 111.8   1.0X
Collect dates  2306   3221
1558  2.2 461.1   0.2X
{code}
Current master:
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date 1052   1130 
> 73  4.8 210.3   1.0X
Collect dates  3251   4943
1624  1.5 650.2   0.3X
{code}
If we subtract preparing DATE column:
* Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row
* master is (650.2 - 210.3) = 439 ns/row

The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - 
349.3)/349.3 = 25%

  was:
DateTimeBenchmark shows the regression

Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date  614655 
> 43  8.1 122.8   1.0X
{code}

Current master:
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date 1154   1206 
> 46  4.3 230.9   1.0X
{code}

The regression is ~x2.


> Perf regression of toJavaDate
> -
>
> Key: SPARK-31443
> URL: https://issues.apache.org/jira/browse/SPARK-31443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> DateTimeBenchmark shows the regression
> Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27]
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date  559603
>   38  8.9 111.8   1.0X
> Collect dates  2306   3221
> 1558  2.2 461.1   0.2X
> {code}
> Current master:
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> --

[jira] [Comment Edited] (SPARK-31443) Perf regression of toJavaDate

2020-04-14 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083217#comment-17083217
 ] 

Maxim Gekk edited comment on SPARK-31443 at 4/14/20, 1:21 PM:
--

FYI [~cloud_fan] I got the numbers on the master without 
https://github.com/apache/spark/pull/28205


was (Author: maxgekk):
FYI [~cloud_fan]

> Perf regression of toJavaDate
> -
>
> Key: SPARK-31443
> URL: https://issues.apache.org/jira/browse/SPARK-31443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> DateTimeBenchmark shows the regression
> Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27]
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date  559603
>   38  8.9 111.8   1.0X
> Collect dates  2306   3221
> 1558  2.2 461.1   0.2X
> {code}
> Current master:
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date 1052   1130
>   73  4.8 210.3   1.0X
> Collect dates  3251   4943
> 1624  1.5 650.2   0.3X
> {code}
> If we subtract preparing DATE column:
> * Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row
> * master is (650.2 - 210.3) = 439 ns/row
> The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - 
> 349.3)/349.3 = 25%



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied

2020-04-14 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-31437:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Try assigning tasks to existing executors by which required resources in 
> ResourceProfile are satisfied
> --
>
> Key: SPARK-31437
> URL: https://issues.apache.org/jira/browse/SPARK-31437
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 3.1.0
>Reporter: Hongze Zhang
>Priority: Major
>
> By the change in [PR|https://github.com/apache/spark/pull/27773] of 
> SPARK-29154, submitted tasks are scheduled onto executors only if resource 
> profile IDs strictly match. As a result Spark always starts new executors for 
> customized ResourceProfiles.
> This limitation makes working with process-local jobs unfriendly. E.g. Task 
> cores has been increased from 1 to 4 in a new stage, and executor has 8 
> slots, it is expected that 2 new tasks can be run on the existing executor 
> but Spark starts new executors for new ResourceProfile. The behavior is 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied

2020-04-14 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083250#comment-17083250
 ] 

Thomas Graves commented on SPARK-31437:
---

This is something I wanted to eventually do but I have not started working on 
this as I wanted to keep it basic for the first implementation to make it 
easier to review and think about. I agree this feature would be nice, but it 
adds a fair bit of complexity.   I think there are other parts of the feature 
that aren't yet implemented that are more important.  I'm actually surprised 
someone is using this already, so glad to see that.

Feel free to work on this if you would like but I want to make sure whatever we 
do it is optional and isn't to invasive.  once you start doing this, you 
definitely have the potential to waste a lot of resources when you place things 
badly. We had discussed some of this in one of the docs or prs, but I can't 
remember which one.  For instance you can place a cpu based task onto an 
executors with GPUs and waste that expensive GPU resource the entire time.  
Note that right now everything is tracked by the resource profile id, so if you 
start putting tasks on on executors with different resource profiles, that 
tracking needs to be enhanced to handle it.  Similarly in the allocation 
manager, it has to handle it differently as well because right now its very 
straight forward - which was intentional to make initial implementation and 
reviewing easier.

> Try assigning tasks to existing executors by which required resources in 
> ResourceProfile are satisfied
> --
>
> Key: SPARK-31437
> URL: https://issues.apache.org/jira/browse/SPARK-31437
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 3.1.0
>Reporter: Hongze Zhang
>Priority: Major
>
> By the change in [PR|https://github.com/apache/spark/pull/27773] of 
> SPARK-29154, submitted tasks are scheduled onto executors only if resource 
> profile IDs strictly match. As a result Spark always starts new executors for 
> customized ResourceProfiles.
> This limitation makes working with process-local jobs unfriendly. E.g. Task 
> cores has been increased from 1 to 4 in a new stage, and executor has 8 
> slots, it is expected that 2 new tasks can be run on the existing executor 
> but Spark starts new executors for new ResourceProfile. The behavior is 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31444) Pyspark memory and cores calculation doesn't account for task cpus

2020-04-14 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-31444:
-

 Summary: Pyspark memory and cores calculation doesn't account for 
task cpus
 Key: SPARK-31444
 URL: https://issues.apache.org/jira/browse/SPARK-31444
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.5
Reporter: Thomas Graves


during changes for stage level scheduling, I discovered a possible issue that 
the calculation for splitting pyspark memory up doesn't take into account the 
spark.task.cpus setting.

Discussion here: 
[https://github.com/apache/spark/pull/28085#discussion_r407573038]

See PythonRunner.scala:

[https://github.com/apache/spark/blob/6b88d136deb99afd9363b208fd6fe5684fe8c3b8/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L90]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30466) remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13

2020-04-14 Thread Nicholas Marion (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083270#comment-17083270
 ] 

Nicholas Marion commented on SPARK-30466:
-

Also there were two more CVEs opened late last year for these same dependencies:

[https://nvd.nist.gov/vuln/detail/CVE-2019-10172|https://nvd.nist.gov/vuln/detail/CVE-2018-5968]
 (CVSS 3.0 Score 5.9 Medium)

[https://nvd.nist.gov/vuln/detail/CVE-2019-10202|https://nvd.nist.gov/vuln/detail/CVE-2018-5968]
 (CVSS 3.0 Score 8.1 High)

> remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13
> --
>
> Key: SPARK-30466
> URL: https://issues.apache.org/jira/browse/SPARK-30466
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Michael Burgener
>Priority: Major
>  Labels: security
>
> These 2 libraries are deprecated and replaced by the jackson-databind 
> libraries which are already included.  These two libraries are flagged by our 
> vulnerability scanners as having the following security vulnerabilities.  
> I've set the priority to Major due to the Critical nature and hopefully they 
> can be addressed quickly.  Please note, I'm not a developer but work in 
> InfoSec and this was flagged when we incorporated spark into our product.  If 
> you feel the priority is not set correctly please change accordingly.  I'll 
> watch the issue and flag our dev team to update once resolved.  
> jackson-mapper-asl-1.9.13
> CVE-2018-7489 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2018-7489] 
>  
> CVE-2017-7525 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-7525]
>  
> CVE-2017-17485 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-17485]
>  
> CVE-2017-15095 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-15095]
>  
> CVE-2018-5968 (CVSS 3.0 Score 8.1 High)
> [https://nvd.nist.gov/vuln/detail/CVE-2018-5968]
>  
> jackson-core-asl-1.9.13
> CVE-2016-7051 (CVSS 3.0 Score 8.6 High)
> https://nvd.nist.gov/vuln/detail/CVE-2016-7051



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31439) Perf regression of fromJavaDate

2020-04-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31439.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28205
[https://github.com/apache/spark/pull/28205]

> Perf regression of fromJavaDate
> ---
>
> Key: SPARK-31439
> URL: https://issues.apache.org/jira/browse/SPARK-31439
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> DateTimeBenchmark shows the regression
> Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27
> {code}
> 
> Conversion from/to external types
> 
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date  614655
>   43  8.1 122.8   1.0X
> {code}
> Current master:
> {code}
> 
> Conversion from/to external types
> 
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date 1154   1206
>   46  4.3 230.9   1.0X
> {code}
> The regression is ~x2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31439) Perf regression of fromJavaDate

2020-04-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31439:
---

Assignee: Maxim Gekk

> Perf regression of fromJavaDate
> ---
>
> Key: SPARK-31439
> URL: https://issues.apache.org/jira/browse/SPARK-31439
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> DateTimeBenchmark shows the regression
> Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27
> {code}
> 
> Conversion from/to external types
> 
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date  614655
>   43  8.1 122.8   1.0X
> {code}
> Current master:
> {code}
> 
> Conversion from/to external types
> 
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date 1154   1206
>   46  4.3 230.9   1.0X
> {code}
> The regression is ~x2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-14 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083314#comment-17083314
 ] 

Maxim Gekk commented on SPARK-31423:


I am working on the issue.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31236) Spark error while consuming data from Kinesis direct end point

2020-04-14 Thread Thukarama Prabhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083327#comment-17083327
 ] 

Thukarama Prabhu commented on SPARK-31236:
--

Raised this issue with AWS support and got below response. Looks like spark 
code requires some fix.

The Kinesis spark code is currently using "AmazonKinesisClient" class to create 
the credentials.

- 
[https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala*L142__;Iw!!CQl3mcHX2A!WBUlCK8fcxFM8B5CFx6mr_k-s02DJwzEfX8ymfNhzi2NtmMoQaXlnPXtwYkXkA4rPqwh$|https://urldefense.com/v3/__https:/github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala*L142__;Iw!!CQl3mcHX2A!WBUlCK8fcxFM8B5CFx6mr_k-s02DJwzEfX8ymfNhzi2NtmMoQaXlnPXtwYkXkA4rPqwh$]

 Unfortunately, the service team confirmed that "AmazonKinesisClient" class is 
deprecated and we do not support it anymore. 

- 
[https://urldefense.com/v3/__https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/kinesis/AmazonKinesisClient.html__;!!CQl3mcHX2A!WBUlCK8fcxFM8B5CFx6mr_k-s02DJwzEfX8ymfNhzi2NtmMoQaXlnPXtwYkXkI7YhmXY$|https://urldefense.com/v3/__https:/docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/kinesis/AmazonKinesisClient.html__;!!CQl3mcHX2A!WBUlCK8fcxFM8B5CFx6mr_k-s02DJwzEfX8ymfNhzi2NtmMoQaXlnPXtwYkXkI7YhmXY$]

 The Kinesis spark code is constructing the Kinesis client from the above 
deprecated class where the region can not be passed and it parses region from 
the endpoint. 

 

We currently use 'AmazonKinesisClientBuilder' class instead of 
'AmazonKinesisClient' in AWS Kinesis SDK where there is an option to set the 
region:

- 
[https://urldefense.com/v3/__https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/kinesis/AmazonKinesisClientBuilder.html__;!!CQl3mcHX2A!WBUlCK8fcxFM8B5CFx6mr_k-s02DJwzEfX8ymfNhzi2NtmMoQaXlnPXtwYkXkDpPaYLV$|https://urldefense.com/v3/__https:/docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/kinesis/AmazonKinesisClientBuilder.html__;!!CQl3mcHX2A!WBUlCK8fcxFM8B5CFx6mr_k-s02DJwzEfX8ymfNhzi2NtmMoQaXlnPXtwYkXkDpPaYLV$]

 

Since Spark Streaming and Kinesis integration code is not owned by AWS, the 
service team recommended that you reach out to the library maintainer to fix 
the issue.

- 
[https://urldefense.com/v3/__https://github.com/apache/spark/tree/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis__;!!CQl3mcHX2A!WBUlCK8fcxFM8B5CFx6mr_k-s02DJwzEfX8ymfNhzi2NtmMoQaXlnPXtwYkXkPpTWTeX$|https://urldefense.com/v3/__https:/github.com/apache/spark/tree/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis__;!!CQl3mcHX2A!WBUlCK8fcxFM8B5CFx6mr_k-s02DJwzEfX8ymfNhzi2NtmMoQaXlnPXtwYkXkPpTWTeX$]

 

> Spark error while consuming data from Kinesis direct end point
> --
>
> Key: SPARK-31236
> URL: https://issues.apache.org/jira/browse/SPARK-31236
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Java API
>Affects Versions: 2.4.5
>Reporter: Thukarama Prabhu
>Priority: Critical
>
> Here is the summary of the issue I am experiencing when using kinesis direct 
> URL for consuming data using spark.
> *Kinesis direct URL:* 
> [https://kinesis-ae1.hdw.r53.deap.tv|https://kinesis-ae1.hdw.r53.deap.tv/] 
> (Failing with Credential should be scoped to a valid region, not 'ae1')
> *Kinesis default URL:* 
> [https://kinesis.us-east-1.amazonaws.com|https://kinesis.us-east-1.amazonaws.com/]
>  (Working)
> Spark code for consuming data
> SparkAWSCredentials credentials = 
> commonService.getSparkAWSCredentials(kinApp.propConfig);
> KinesisInputDStream kinesisStream = KinesisInputDStream.builder()
>     .streamingContext(jssc)
>     .checkpointAppName(applicationName)
>     .streamName(streamName)
>     .endpointUrl(endpointURL)
>     .regionName(regionName)
>     
> .initialPosition(KinesisInitialPositions.fromKinesisInitialPosition(initPosition))
>     .checkpointInterval(checkpointInterval)
>     .kinesisCredentials(credentials)
>     .storageLevel(StorageLevel.MEMORY_AND_DISK_2()).build();
>  
> Spark version 2.4.4
> 
>     org.apache.spark
>     spark-streaming-kinesis-asl_2.11
>     2.4.5
> 
> 
>     com.amazonaws
>     amazon-kinesis-client
>     1.13.3
> 
> 
>     com.amazonaws
>     aws-java-sdk
>     1.11.747
> 
>  
> The spark application works fine when I use default URL but fails when I 
> change to direct URL with below error. The direct URL works when I try to 
> publish to direct kinesis URL. Issue only when I try to consume data.
>  
> 2020-03-24 08:43:40,650 ERROR - Caught

[jira] [Created] (SPARK-31445) Avoid floating-point division in millisToDays

2020-04-14 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31445:
--

 Summary: Avoid floating-point division in millisToDays
 Key: SPARK-31445
 URL: https://issues.apache.org/jira/browse/SPARK-31445
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5
Reporter: Maxim Gekk


As the benchmark https://github.com/MaxGekk/spark/pull/27, and comparison to 
Spark 3.0+an optimisation of fromJavaDate in 
https://github.com/apache/spark/pull/28205 show that floating-point ops in 
millisToDays badly impact on the performance of conversion java.sql.Date to 
Catalyst's date values. The ticket aims to replace double ops by int/long ops.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31446) Make html elements for a paged table possible to have different id attribute.

2020-04-14 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-31446:
--

 Summary: Make html elements for a paged table possible to have 
different id attribute.
 Key: SPARK-31446
 URL: https://issues.apache.org/jira/browse/SPARK-31446
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


Some pages have a paged table and  page navigations above / below the table.
But corresponding HTML elements between the two page navigations for a table 
have the same id attribute. Every id element should be unique.
For example, there are two `form-completedJob-table-page` id in JobsPage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31445) Avoid floating-point division in millisToDays

2020-04-14 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-31445.

Resolution: Won't Fix

> Avoid floating-point division in millisToDays
> -
>
> Key: SPARK-31445
> URL: https://issues.apache.org/jira/browse/SPARK-31445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Priority: Minor
>
> As the benchmark https://github.com/MaxGekk/spark/pull/27, and comparison to 
> Spark 3.0+an optimisation of fromJavaDate in 
> https://github.com/apache/spark/pull/28205 show that floating-point ops in 
> millisToDays badly impact on the performance of conversion java.sql.Date to 
> Catalyst's date values. The ticket aims to replace double ops by int/long ops.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31403) TreeNode asCode function incorrectly handles null literals

2020-04-14 Thread Carl Sverre (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083576#comment-17083576
 ] 

Carl Sverre commented on SPARK-31403:
-

Thanks for checking this on master [~hyukjin.kwon]!  If you happen to have 
Spark master running, can you send me `plan.asCode` and/or the output of your 
repro?  I am very curious to understand what Spark is generating in this case.  
It may be possible to repro this in master with something as simple as `select 
null from t`.

If you don't have time I will see if I can get spark master running and try to 
repro this myself.  Thanks!

> TreeNode asCode function incorrectly handles null literals
> --
>
> Key: SPARK-31403
> URL: https://issues.apache.org/jira/browse/SPARK-31403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Carl Sverre
>Priority: Minor
>
> In the TreeNode code in Catalyst the asCode function incorrectly handles null 
> literals.  When it tries to render a null literal it will match {{null}} 
> using the third case expression and try to call {{null.toString}} which will 
> raise a NullPointerException.
> I verified this bug exists in Spark 2.4.4 and the same code appears to be in 
> master:
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L707]
> The fix seems trivial - add an explicit case for null.
> One way to reproduce this is via:
> {code:java}
>   val plan =
> spark
>   .sql("select if(isnull(id), null, 2) from testdb_jdbc.users")
>   .queryExecution
>   .optimizedPlan
>   println(plan.asInstanceOf[Project].projectList.head.asCode) {code}
> However any other way which generates a Literal with the value null will 
> cause the issue.
> In this case the above SparkSQL will generate the literal: {{Literal(null, 
> IntegerType)}} for the "trueValue" of the if statement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-14 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083595#comment-17083595
 ] 

Maxim Gekk commented on SPARK-31423:


I have debugged this slightly on Spark 2.4, so, '1582-10-14' falls to the case 
while parsing from UTF8String:
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2762-L2768
{code:java}
// The date is in a "missing" period.
if (!isLenient()) {
throw new IllegalArgumentException("the specified date 
doesn't exist");
}
// Take the Julian date for compatibility, which
// will produce a Gregorian date.
fixedDate = jfd;
{code}
In the strong mode, the code 
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L517
 would throw the exception:
{code}
throw new IllegalArgumentException("the specified date doesn't exist")
{code}
but we are in the "weak" mode, in this way Java 7 GregorianCalendar interprets 
the date especially:
{code}
// Take the Julian date for compatibility, which
 // will produce a Gregorian date.
{code}

The date '1582-10-14' doesn't exist in the hybrid calendar used by Java 7 time 
API. It is questionable how to handle the date in such calendar. 

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should

[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2020-04-14 Thread Alexandre Archambault (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083606#comment-17083606
 ] 

Alexandre Archambault commented on SPARK-2620:
--

FYI, adding "final" in front of "case class P(name: String)" fixes that.

{{final case class P(name: String)}}

Beware that this works thanks to a bug, described around 
[https://github.com/scala/bug/issues/4440#issuecomment-365858185] in 
particular, that makes final case classes ignore their outer reference in 
equals. It's still not "fixed" as of scala 2.12.10 / 2.13.1, so adding final 
still works as a workaround.

[https://github.com/scala/bug/issues/11940] discusses maybe allowing to tweak 
outer references comparisons, which could offer a way to properly fix the issue 
here. (One could add an equals method to the outer wrapper, if 
[https://github.com/scala/bug/issues/11940] gets fixed, fixing the issue here.)

> case class cannot be used as key for reduce
> ---
>
> Key: SPARK-2620
> URL: https://issues.apache.org/jira/browse/SPARK-2620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.0.0, 1.1.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0, 
> 2.2.0, 2.3.0
> Environment: reproduced on spark-shell local[4]
>Reporter: Gerard Maas
>Assignee: Tobias Schlatter
>Priority: Critical
>  Labels: bulk-closed, case-class, core
>
> Using a case class as a key doesn't seem to work properly on Spark 1.0.0
> A minimal example:
> case class P(name:String)
> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
> (P(bob),1), (P(abe),1), (P(charly),1))
> In contrast to the expected behavior, that should be equivalent to:
> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
> groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9621) Closure inside RDD doesn't properly close over environment

2020-04-14 Thread Alexandre Archambault (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083608#comment-17083608
 ] 

Alexandre Archambault commented on SPARK-9621:
--

FYI, like in https://issues.apache.org/jira/browse/SPARK-2620, adding "final" 
when defining the case class seems to fix that.

{{final case class MyTest(i: Int)}}

Beware that this workaround actually relies on a bug in scala, unfixed as of 
scala 2.12.10 / 2.13.1. See the discussion 
[here|https://issues.apache.org/jira/browse/SPARK-2620?focusedCommentId=17083606&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17083606].

> Closure inside RDD doesn't properly close over environment
> --
>
> Key: SPARK-9621
> URL: https://issues.apache.org/jira/browse/SPARK-9621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
> Environment: Ubuntu 15.04, spark-1.4.1-bin-hadoop2.6 package
>Reporter: Joe Near
>Priority: Major
>
> I expect the following:
> case class MyTest(i: Int)
> val tv = MyTest(1)
> val res = sc.parallelize(Array((t: MyTest) => t == tv)).first()(tv)
> to be "true." It is "false," when I type this into spark-shell. It seems the 
> closure is changed somehow when it's serialized and deserialized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-14 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083611#comment-17083611
 ] 

Bruce Robbins commented on SPARK-31423:
---

{quote}It is questionable how to handle the date in such calendar.{quote}
Sorry if I am asking a dumb question. Are you saying that ORC stores only in 
the hybrid Julian calendar?

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2620) case class cannot be used as key for reduce

2020-04-14 Thread Alexandre Archambault (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083606#comment-17083606
 ] 

Alexandre Archambault edited comment on SPARK-2620 at 4/14/20, 10:02 PM:
-

FYI, adding "final" in front of "case class P(name: String)" fixes that.

{{final case class P(name: String)}}

Beware that this works thanks to a bug, described around 
[https://github.com/scala/bug/issues/4440#issuecomment-365858185], that makes 
final case classes ignore their outer reference in equals. It's still not 
"fixed" as of scala 2.12.10 / 2.13.1, so adding final still works as a 
workaround.

[https://github.com/scala/bug/issues/11940] discusses maybe allowing to tweak 
outer references comparisons, which could offer a way to properly fix the issue 
here. (One could add an equals method to the outer wrapper, if 
[https://github.com/scala/bug/issues/11940] gets fixed, fixing the issue here.)


was (Author: alexarchambault):
FYI, adding "final" in front of "case class P(name: String)" fixes that.

{{final case class P(name: String)}}

Beware that this works thanks to a bug, described around 
[https://github.com/scala/bug/issues/4440#issuecomment-365858185] in 
particular, that makes final case classes ignore their outer reference in 
equals. It's still not "fixed" as of scala 2.12.10 / 2.13.1, so adding final 
still works as a workaround.

[https://github.com/scala/bug/issues/11940] discusses maybe allowing to tweak 
outer references comparisons, which could offer a way to properly fix the issue 
here. (One could add an equals method to the outer wrapper, if 
[https://github.com/scala/bug/issues/11940] gets fixed, fixing the issue here.)

> case class cannot be used as key for reduce
> ---
>
> Key: SPARK-2620
> URL: https://issues.apache.org/jira/browse/SPARK-2620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.0.0, 1.1.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0, 
> 2.2.0, 2.3.0
> Environment: reproduced on spark-shell local[4]
>Reporter: Gerard Maas
>Assignee: Tobias Schlatter
>Priority: Critical
>  Labels: bulk-closed, case-class, core
>
> Using a case class as a key doesn't seem to work properly on Spark 1.0.0
> A minimal example:
> case class P(name:String)
> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
> (P(bob),1), (P(abe),1), (P(charly),1))
> In contrast to the expected behavior, that should be equivalent to:
> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
> groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9621) Closure inside RDD doesn't properly close over environment

2020-04-14 Thread Guillaume Martres (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083639#comment-17083639
 ] 

Guillaume Martres commented on SPARK-9621:
--

> unfixed as of scala 2.12.10 / 2.13.1

... but fixed in Dotty, so it's really not a great idea to rely on that.

> Closure inside RDD doesn't properly close over environment
> --
>
> Key: SPARK-9621
> URL: https://issues.apache.org/jira/browse/SPARK-9621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
> Environment: Ubuntu 15.04, spark-1.4.1-bin-hadoop2.6 package
>Reporter: Joe Near
>Priority: Major
>
> I expect the following:
> case class MyTest(i: Int)
> val tv = MyTest(1)
> val res = sc.parallelize(Array((t: MyTest) => t == tv)).first()(tv)
> to be "true." It is "false," when I type this into spark-shell. It seems the 
> closure is changed somehow when it's serialized and deserialized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31256) Dropna doesn't work for struct columns

2020-04-14 Thread Sunitha Kambhampati (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083659#comment-17083659
 ] 

Sunitha Kambhampati commented on SPARK-31256:
-

I can repro the issue using the Scala api on trunk. 

It looks like SPARK-30065 explicitly removed the support for nested column 
resolution in drop.    The change went into trunk and as well as the 2.4.5 
branch.

E.g in the test:

https://github.com/apache/spark/blame/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala#L302

cc [~cloud_fan], [~imback82]    This seems to be a regression. Is there a 
reason to remove this behavior?

 

> Dropna doesn't work for struct columns
> --
>
> Key: SPARK-31256
> URL: https://issues.apache.org/jira/browse/SPARK-31256
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
> Python 3.7.4
>Reporter: Michael Souder
>Priority: Major
>
> Dropna using a subset with a column from a struct drops the entire data frame.
> {code:python}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, 
> None)], schema=['age', 'height', 'name'])
> df.show()
> +---+--+-+
> |age|height| name|
> +---+--+-+
> |  5|80|Alice|
> | 10|  null|  Bob|
> | 15|80| null|
> +---+--+-+
> # this works just fine
> df.dropna(subset=['name']).show()
> +---+--+-+
> |age|height| name|
> +---+--+-+
> |  5|80|Alice|
> | 10|  null|  Bob|
> +---+--+-+
> # now add a struct column
> df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', 
> 'name'))
> df_with_struct.show(truncate=False)
> +---+--+-+--+
> |age|height|name |struct_col|
> +---+--+-+--+
> |5  |80|Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]|
> |15 |80|null |[15, 80,] |
> +---+--+-+--+
> # now dropna drops the whole dataframe when you use struct_col
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+--++--+
> |age|height|name|struct_col|
> +---+--++--+
> +---+--++--+
> {code}
>  I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 
> with python 3.6.8 and in both, the result looks like:
> {code:python}
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+--+-+--+
> |age|height|name |struct_col|
> +---+--+-+--+
> |5  |80|Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]|
> +---+--+-+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31256) Dropna doesn't work for struct columns

2020-04-14 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083666#comment-17083666
 ] 

Terry Kim commented on SPARK-31256:
---

Let me look into this.

> Dropna doesn't work for struct columns
> --
>
> Key: SPARK-31256
> URL: https://issues.apache.org/jira/browse/SPARK-31256
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
> Python 3.7.4
>Reporter: Michael Souder
>Priority: Major
>
> Dropna using a subset with a column from a struct drops the entire data frame.
> {code:python}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, 
> None)], schema=['age', 'height', 'name'])
> df.show()
> +---+--+-+
> |age|height| name|
> +---+--+-+
> |  5|80|Alice|
> | 10|  null|  Bob|
> | 15|80| null|
> +---+--+-+
> # this works just fine
> df.dropna(subset=['name']).show()
> +---+--+-+
> |age|height| name|
> +---+--+-+
> |  5|80|Alice|
> | 10|  null|  Bob|
> +---+--+-+
> # now add a struct column
> df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', 
> 'name'))
> df_with_struct.show(truncate=False)
> +---+--+-+--+
> |age|height|name |struct_col|
> +---+--+-+--+
> |5  |80|Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]|
> |15 |80|null |[15, 80,] |
> +---+--+-+--+
> # now dropna drops the whole dataframe when you use struct_col
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+--++--+
> |age|height|name|struct_col|
> +---+--++--+
> +---+--++--+
> {code}
>  I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 
> with python 3.6.8 and in both, the result looks like:
> {code:python}
> df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
> +---+--+-+--+
> |age|height|name |struct_col|
> +---+--+-+--+
> |5  |80|Alice|[5, 80, Alice]|
> |10 |null  |Bob  |[10,, Bob]|
> +---+--+-+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31447) DATE_PART functions produces incorrect result

2020-04-14 Thread Sathyaprakash Govindasamy (Jira)

Sathyaprakash Govindasamy created SPARK-31447:
-

 Summary: DATE_PART functions produces incorrect result
 Key: SPARK-31447
 URL: https://issues.apache.org/jira/browse/SPARK-31447
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3
Reporter: Sathyaprakash Govindasamy


Spark does not extract correct date part from calendar interval. Below is one 
example for extracting day from calendar interval
{code:java}
spark.sql("SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - 
cast('2020-01-01 00:00:00' as timestamp)))").show{code}
++
|date_part('DAY', subtracttimestamps(CAST('2020-01-15 00:00:00' AS TIMESTAMP), 
CAST('2020-01-01 00:00:00' AS TIMESTAMP)))|
++
| 0|
++

Actual output 0 days
Correct output 14 days

This is because SubtractTimestamps expression calculates difference and 
populates only microseconds field. months and days field are set to zero
{code:java}
new CalendarInterval(months=0, days=0, microseconds=end.asInstanceOf[Long] - 
start.asInstanceOf[Long]){code}

https://github.com/apache/spark/blob/2c5d489679ba3814973680d65853877664bcd931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2211

But ExtractIntervalDays expression retrieves days information from days field 
in CalendarInterval and returns zero.
{code:java}
def getDays(interval: CalendarInterval): Int = {
 interval.days
 }{code}
https://github.com/apache/spark/blob/2c5d489679ba3814973680d65853877664bcd931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala#L73



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31447) DATE_PART functions produces incorrect result

2020-04-14 Thread Sathyaprakash Govindasamy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083672#comment-17083672
 ] 

Sathyaprakash Govindasamy commented on SPARK-31447:
---

I am proposing in SubtractTimestamps, we need to populate month, days field as 
well for CalendarInterval. Also, in ExtractIntervalDays, while getting number 
of days, we need to read input from month propery as well to calculate total 
number of days.

For this calendar interval _new CalendarInterval(months=2, days=4, 
microseconds=6000)_, getDays function in IntervalUtils should return ((2 * 
MONTHS_PER_DAY) + 4) instead of 4.

I will raise PR for this

> DATE_PART functions produces incorrect result
> -
>
> Key: SPARK-31447
> URL: https://issues.apache.org/jira/browse/SPARK-31447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3
>Reporter: Sathyaprakash Govindasamy
>Priority: Major
>
> Spark does not extract correct date part from calendar interval. Below is one 
> example for extracting day from calendar interval
> {code:java}
> spark.sql("SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) 
> - cast('2020-01-01 00:00:00' as timestamp)))").show{code}
> ++
> |date_part('DAY', subtracttimestamps(CAST('2020-01-15 00:00:00' AS 
> TIMESTAMP), CAST('2020-01-01 00:00:00' AS TIMESTAMP)))|
> ++
> | 0|
> ++
> Actual output 0 days
> Correct output 14 days
> This is because SubtractTimestamps expression calculates difference and 
> populates only microseconds field. months and days field are set to zero
> {code:java}
> new CalendarInterval(months=0, days=0, microseconds=end.asInstanceOf[Long] - 
> start.asInstanceOf[Long]){code}
> https://github.com/apache/spark/blob/2c5d489679ba3814973680d65853877664bcd931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2211
> But ExtractIntervalDays expression retrieves days information from days field 
> in CalendarInterval and returns zero.
> {code:java}
> def getDays(interval: CalendarInterval): Int = {
>  interval.days
>  }{code}
> https://github.com/apache/spark/blob/2c5d489679ba3814973680d65853877664bcd931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala#L73



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31186) toPandas fails on simple query (collect() works)

2020-04-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31186:
-
Fix Version/s: 2.4.6

> toPandas fails on simple query (collect() works)
> 
>
> Key: SPARK-31186
> URL: https://issues.apache.org/jira/browse/SPARK-31186
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Michael Chirico
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.0.0, 2.4.6
>
>
> My pandas is 0.25.1.
> I ran the following simple code (cross joins are enabled):
> {code:python}
> spark.sql('''
> select t1.*, t2.* from (
>   select explode(sequence(1, 3)) v
> ) t1 left join (
>   select explode(sequence(1, 3)) v
> ) t2
> ''').toPandas()
> {code}
> and got a ValueError from pandas:
> > ValueError: The truth value of a Series is ambiguous. Use a.empty, 
> > a.bool(), a.item(), a.any() or a.all().
> Collect works fine:
> {code:python}
> spark.sql('''
> select * from (
>   select explode(sequence(1, 3)) v
> ) t1 left join (
>   select explode(sequence(1, 3)) v
> ) t2
> ''').collect()
> # [Row(v=1, v=1),
> #  Row(v=1, v=2),
> #  Row(v=1, v=3),
> #  Row(v=2, v=1),
> #  Row(v=2, v=2),
> #  Row(v=2, v=3),
> #  Row(v=3, v=1),
> #  Row(v=3, v=2),
> #  Row(v=3, v=3)]
> {code}
> I imagine it's related to the duplicate column names, but this doesn't fail:
> {code:python}
> spark.sql("select 1 v, 1 v").toPandas()
> # v   v
> # 0   1   1
> {code}
> Also no issue for multiple rows:
> spark.sql("select 1 v, 1 v union all select 1 v, 2 v").toPandas()
> It also works when not using a cross join but a janky 
> programatically-generated union all query:
> {code:python}
> cond = []
> for ii in range(3):
> for jj in range(3):
> cond.append(f'select {ii+1} v, {jj+1} v')
> spark.sql(' union all '.join(cond)).toPandas()
> {code}
> As near as I can tell, the output is identical to the explode output, making 
> this issue all the more peculiar, as I thought toPandas() is applied to the 
> output of collect(), so if collect() gives the same output, how can 
> toPandas() fail in one case and not the other? Further, the lazy DataFrame is 
> the same: DataFrame[v: int, v: int] in both cases. I must be missing 
> something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-14 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083822#comment-17083822
 ] 

Wenchen Fan commented on SPARK-31423:
-

The ORC file format spec doesn't specify the calendar, but the ORC library 
(reader and writer) uses java 7 Date/Timestamp which indicates the hybrid 
Julian calendar.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-14 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083828#comment-17083828
 ] 

Wenchen Fan commented on SPARK-31423:
-

We probably have a bug about picking the next valid date, but the behavior 
itself is expected: Spark always pick the next valid date
{code}
scala> sql("select date'1990-9-31'").show
+-+
|DATE '1990-10-01'|
+-+
|   1990-10-01|
+-+
{code}

It's a bit weird that this date is valid in Spark but invalid in ORC because 
the calendar is different. I'm OK to fail for this case, with a config to allow 
users to write it anyway by picking the next valid date.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31443) Perf regression of toJavaDate

2020-04-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31443:
---

Assignee: Maxim Gekk

> Perf regression of toJavaDate
> -
>
> Key: SPARK-31443
> URL: https://issues.apache.org/jira/browse/SPARK-31443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> DateTimeBenchmark shows the regression
> Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27]
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date  559603
>   38  8.9 111.8   1.0X
> Collect dates  2306   3221
> 1558  2.2 461.1   0.2X
> {code}
> Current master:
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date 1052   1130
>   73  4.8 210.3   1.0X
> Collect dates  3251   4943
> 1624  1.5 650.2   0.3X
> {code}
> If we subtract preparing DATE column:
> * Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row
> * master is (650.2 - 210.3) = 439 ns/row
> The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - 
> 349.3)/349.3 = 25%



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31443) Perf regression of toJavaDate

2020-04-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31443.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28212
[https://github.com/apache/spark/pull/28212]

> Perf regression of toJavaDate
> -
>
> Key: SPARK-31443
> URL: https://issues.apache.org/jira/browse/SPARK-31443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> DateTimeBenchmark shows the regression
> Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27]
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date  559603
>   38  8.9 111.8   1.0X
> Collect dates  2306   3221
> 1558  2.2 461.1   0.2X
> {code}
> Current master:
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date 1052   1130
>   73  4.8 210.3   1.0X
> Collect dates  3251   4943
> 1624  1.5 650.2   0.3X
> {code}
> If we subtract preparing DATE column:
> * Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row
> * master is (650.2 - 210.3) = 439 ns/row
> The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - 
> 349.3)/349.3 = 25%



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes

2020-04-14 Thread Abhishek Dixit (Jira)

Abhishek Dixit created SPARK-31448:
--

 Summary: Difference in Storage Levels used in cache() and 
persist() for pyspark dataframes
 Key: SPARK-31448
 URL: https://issues.apache.org/jira/browse/SPARK-31448
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.3
Reporter: Abhishek Dixit


There is a difference in default storage level *MEMORY_AND_DISK* in pyspark and 
scala.

*Scala*: StorageLevel(true, true, false, true)

*Pyspark:* StorageLevel(True, True, False, False)

 

*Problem Description:* 

Calling *df.cache()*  for pyspark dataframe directly invokes Scala method 
cache() and Storage Level used is StorageLevel(true, true, false, true).

But calling *df.persist()* for pyspark dataframe sets the 
newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and then 
invokes Scala function persist(newStorageLevel).

*Possible Fix:*
Invoke pyspark function persist inside pyspark function cache instead of 
calling the scala function directly.

I can raise a PR for this fix if someone can confirm that this is a bug and the 
possible fix is the correct approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2020-04-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083849#comment-17083849
 ] 

Jungtaek Lim commented on SPARK-26385:
--

Probably you may need to share the entire log messages logged under 
"org.apache.spark.deploy.yarn.security.AMCredentialRenewer" (Note: it runs on 
AM) so that we can confirm whether the renewing is scheduled and executed 
properly or not.

It would be also helpful if you can make clear which mode you use - yarn-client 
vs yarn-cluster.

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:3

[jira] [Created] (SPARK-31449) Is there a difference between JDK and Spark's time zone offset calculation

2020-04-14 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31449:
--

 Summary: Is there a difference between JDK and Spark's time zone 
offset calculation
 Key: SPARK-31449
 URL: https://issues.apache.org/jira/browse/SPARK-31449
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.5
Reporter: Maxim Gekk


Spark 2.4 calculates time zone offsets from wall clock timestamp using 
`DateTimeUtils.getOffsetFromLocalMillis()` (see 
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118):
{code:scala}
  private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): 
Long = {
var guess = tz.getRawOffset
// the actual offset should be calculated based on milliseconds in UTC
val offset = tz.getOffset(millisLocal - guess)
if (offset != guess) {
  guess = tz.getOffset(millisLocal - offset)
  if (guess != offset) {
// fallback to do the reverse lookup using java.sql.Timestamp
// this should only happen near the start or end of DST
val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
val year = getYear(days)
val month = getMonth(days)
val day = getDayOfMonth(days)

var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
if (millisOfDay < 0) {
  millisOfDay += MILLIS_PER_DAY.toInt
}
val seconds = (millisOfDay / 1000L).toInt
val hh = seconds / 3600
val mm = seconds / 60 % 60
val ss = seconds % 60
val ms = millisOfDay % 1000
val calendar = Calendar.getInstance(tz)
calendar.set(year, month - 1, day, hh, mm, ss)
calendar.set(Calendar.MILLISECOND, ms)
guess = (millisLocal - calendar.getTimeInMillis()).toInt
  }
}
guess
  }
{code}

Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see 
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801:
{code:java}
if (zone instanceof ZoneInfo) {
((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets);
} else {
int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ?
internalGet(ZONE_OFFSET) : 
zone.getRawOffset();
zone.getOffsets(millis - gmtOffset, zoneOffsets);
}
{code}

Need to investigate are there any differences in results between 2 approaches.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29137) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_train_prediction

2020-04-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083855#comment-17083855
 ] 

Jungtaek Lim commented on SPARK-29137:
--

Still valid on latest master (3.1.0-SNAPSHOT).

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121229


> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_train_prediction
> --
>
> Key: SPARK-29137
> URL: https://issues.apache.org/jira/browse/SPARK-29137
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110686/testReport/]
> {code:java}
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 503, in test_train_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 69, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 498, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.672640157855923 not greater than 2 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29137) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_train_prediction

2020-04-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083855#comment-17083855
 ] 

Jungtaek Lim edited comment on SPARK-29137 at 4/15/20, 6:58 AM:


Still valid on latest master (3.1.0-SNAPSHOT).

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121229

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121231

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121232



was (Author: kabhwan):
Still valid on latest master (3.1.0-SNAPSHOT).

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121229


> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_train_prediction
> --
>
> Key: SPARK-29137
> URL: https://issues.apache.org/jira/browse/SPARK-29137
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110686/testReport/]
> {code:java}
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 503, in test_train_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 69, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 498, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.672640157855923 not greater than 2 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26646) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2020-04-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083856#comment-17083856
 ] 

Jungtaek Lim commented on SPARK-26646:
--

Looks like still happening on master branch (3.1.0-SNAPSHOT)

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121232

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-26646
> URL: https://issues.apache.org/jira/browse/SPARK-26646
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101356/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101358/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101254/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100941/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100327/console
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition, timeout=60.0)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 69, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 362, in condition
> self.assertGreater(errors[1] - errors[-1], 0.3)
> AssertionError: -0.070062 not greater than 0.3
> --
> Ran 13 tests in 198.327s
> FAILED (failures=1, skipped=1)
> Had test failures in pyspark.mllib.tests.test_streaming_algorithms with 
> python3.4; see logs.
> {code}
> It apparently became less flaky after increasing the time at SPARK-26275 but 
> looks now it became flacky due to unexpected results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29222) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence

2020-04-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083857#comment-17083857
 ] 

Jungtaek Lim commented on SPARK-29222:
--

Still happening on master (3.1.0-SNAPSHOT)

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121232

> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence
> ---
>
> Key: SPARK-29222
> URL: https://issues.apache.org/jira/browse/SPARK-29222
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111237/testReport/]
> {code:java}
> Error Message
> 7 != 10
> StacktraceTraceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 429, in test_parameter_convergence
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 425, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 7 != 10
>{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

50 matches

Mail list logo