[jira] [Comment Edited] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10

2022-11-07 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17630001#comment-17630001
 ] 

Dustin Smith edited comment on SPARK-40351 at 11/7/22 8:14 PM:
---

[~tkhomichuk] For point 3, I think it is related to logical optimizations for 
Spark SQL for sum and aggregation according to their [Databricks] paper on 
Spark SQL (1995) see section 4.3.2 in the link.

Since it is for optimization decimal operations, I dont think it would be 
allowed to overwrite it. This is just based on my thought about their paper and 
may be incorrect (take with a grain of salt).

[https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf]


was (Author: dwsmith1983):
[~tkhomichuk] For point 3, I think it is related to logical optimizations for 
Spark SQL for sum and aggregation according to their [Databricks] paper on 
Spark SQL see section 4.3.2 in the link.

Since it is for optimization decimal operations, I dont think it would be 
allowed to overwrite it. This is just based on my thought about their paper and 
may be incorrect (take with a grain of salt).

[https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf]

> Spark Sum increases the precision of DecimalType arguments by 10
> 
>
> Key: SPARK-40351
> URL: https://issues.apache.org/jira/browse/SPARK-40351
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 3.2.0
>Reporter: Tymofii
>Priority: Minor
>
> Currently in Spark automatically increases Decimal field by 10 (hard coded 
> value) after SUM aggregate operation - 
> [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.]
> There are a couple of questions:
>  # Why was 10 chosen as default one?
>  # Does it make sense to allow the user to override this value via 
> configuration? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10

2022-11-07 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17630001#comment-17630001
 ] 

Dustin Smith edited comment on SPARK-40351 at 11/7/22 8:14 PM:
---

[~tkhomichuk] For point 3, I think it is related to logical optimizations for 
Spark SQL for sum and aggregation according to their [Databricks] paper on 
Spark SQL (2015) see section 4.3.2 in the link.

Since it is for optimization decimal operations, I dont think it would be 
allowed to overwrite it. This is just based on my thought about their paper and 
may be incorrect (take with a grain of salt).

[https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf]


was (Author: dwsmith1983):
[~tkhomichuk] For point 3, I think it is related to logical optimizations for 
Spark SQL for sum and aggregation according to their [Databricks] paper on 
Spark SQL (1995) see section 4.3.2 in the link.

Since it is for optimization decimal operations, I dont think it would be 
allowed to overwrite it. This is just based on my thought about their paper and 
may be incorrect (take with a grain of salt).

[https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf]

> Spark Sum increases the precision of DecimalType arguments by 10
> 
>
> Key: SPARK-40351
> URL: https://issues.apache.org/jira/browse/SPARK-40351
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 3.2.0
>Reporter: Tymofii
>Priority: Minor
>
> Currently in Spark automatically increases Decimal field by 10 (hard coded 
> value) after SUM aggregate operation - 
> [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.]
> There are a couple of questions:
>  # Why was 10 chosen as default one?
>  # Does it make sense to allow the user to override this value via 
> configuration? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10

2022-11-07 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17630001#comment-17630001
 ] 

Dustin Smith edited comment on SPARK-40351 at 11/7/22 8:07 PM:
---

[~tkhomichuk] For point 3, I think it is related to logical optimizations for 
Spark SQL for sum and aggregation according to their [Databricks] paper on 
Spark SQL see section 4.3.2 in the link.

Since it is for optimization decimal operations, I dont think it would be 
allowed to overwrite it. This is just based on my thought about their paper and 
may be incorrect (take with a grain of salt).

[https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf]


was (Author: dwsmith1983):
[~tkhomichuk] For point 3, I think it is related to logical optimizations for 
Spark SQL for sum and aggregation according to their [Databricks] paper on 
Spark SQL see section 4.3.2 in the link.

https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

> Spark Sum increases the precision of DecimalType arguments by 10
> 
>
> Key: SPARK-40351
> URL: https://issues.apache.org/jira/browse/SPARK-40351
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 3.2.0
>Reporter: Tymofii
>Priority: Minor
>
> Currently in Spark automatically increases Decimal field by 10 (hard coded 
> value) after SUM aggregate operation - 
> [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.]
> There are a couple of questions:
>  # Why was 10 chosen as default one?
>  # Does it make sense to allow the user to override this value via 
> configuration? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10

2022-11-07 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17630001#comment-17630001
 ] 

Dustin Smith commented on SPARK-40351:
--

[~tkhomichuk] For point 3, I think it is related to logical optimizations for 
Spark SQL for sum and aggregation according to their [Databricks] paper on 
Spark SQL see section 4.3.2 in the link.

https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

> Spark Sum increases the precision of DecimalType arguments by 10
> 
>
> Key: SPARK-40351
> URL: https://issues.apache.org/jira/browse/SPARK-40351
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 3.2.0
>Reporter: Tymofii
>Priority: Minor
>
> Currently in Spark automatically increases Decimal field by 10 (hard coded 
> value) after SUM aggregate operation - 
> [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.]
> There are a couple of questions:
>  # Why was 10 chosen as default one?
>  # Does it make sense to allow the user to override this value via 
> configuration? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40934) pyspark.pandas.read_csv parses dates, but docs state otherwise

2022-11-06 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629542#comment-17629542
 ] 

Dustin Smith commented on SPARK-40934:
--

What should be the desired behavior of a this date column when not parsed? 
Should it be converted to string type?

> pyspark.pandas.read_csv parses dates, but docs state otherwise
> --
>
> Key: SPARK-40934
> URL: https://issues.apache.org/jira/browse/SPARK-40934
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.3.1
>Reporter: Stefaan Lippens
>Priority: Major
>
> from 
> [https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv.html]
>  :
> {quote}parse_dates:
> boolean or list of ints or names or list of lists or dict, default False.
> Currently only False is allowed.
> {quote}
> This documentation suggests that dates are never parsed, but apparently they 
> are always parsed (and it can not be disabled):
> {code:python}
> import pyspark.pandas
> df = pyspark.pandas.read_csv("data.csv", parse_dates=False)
> print(df)
> print(df.dtypes)
> {code}
> with this data
> {code:java}
> date,feature_index,band_0,band_1,band_2
> 2021-01-05T01:00:00.000+01:00,2,5.0,4.5,3.75
> 2021-01-05T01:00:00.000+01:00,0,5.0,1.0,2.25
> 2021-01-05T01:00:00.000+01:00,1,5.0,3.5,4.0
> 2021-01-15T01:00:00.000+01:00,2,15.0,4.5,3.75
> 2021-01-15T01:00:00.000+01:00,0,15.0,1.0,2.25
> {code}
> gives
> {code:java}
>  date  feature_index  band_0  band_1  band_2
> 0 2021-01-05 01:00:00  2 5.0 4.53.75
> 1 2021-01-05 01:00:00  0 5.0 1.02.25
> 2 2021-01-05 01:00:00  1 5.0 3.54.00
> 3 2021-01-15 01:00:00  215.0 4.53.75
> 4 2021-01-15 01:00:00  015.0 1.02.25
> date datetime64[ns]
> feature_index int32
> band_0  float64
> band_1  float64
> band_2  float64
> dtype: object
> {code}
> Notice how the dates are parsed (e.g.  dtype {{datetime64[ns]}} for {{date}})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27981) Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`

2021-01-11 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263052#comment-17263052
 ] 

Dustin Smith edited comment on SPARK-27981 at 1/12/21, 3:33 AM:


This problem exist on 2.4.6 as well with JDK 11.0.4


was (Author: dwsmith1983):
This problem exist on 2.4.6 as well

> Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`
> --
>
> Key: SPARK-27981
> URL: https://issues.apache.org/jira/browse/SPARK-27981
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This PR aims to remove the following warnings for `java.nio.Bits.unaligned` 
> at JDK9/10/11/12. Please note that there are more warnings which is beyond of 
> this PR's scope.
> {code}
> bin/spark-shell --driver-java-options=--illegal-access=warn
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/Users/dhyun/APACHE/spark-release/spark-3.0/assembly/target/scala-2.12/jars/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar)
>  to method java.nio.Bits.unaligned()
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27981) Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`

2021-01-11 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263052#comment-17263052
 ] 

Dustin Smith commented on SPARK-27981:
--

This problem exist on 2.4.6 as well

> Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`
> --
>
> Key: SPARK-27981
> URL: https://issues.apache.org/jira/browse/SPARK-27981
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This PR aims to remove the following warnings for `java.nio.Bits.unaligned` 
> at JDK9/10/11/12. Please note that there are more warnings which is beyond of 
> this PR's scope.
> {code}
> bin/spark-shell --driver-java-options=--illegal-access=warn
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/Users/dhyun/APACHE/spark-release/spark-3.0/assembly/target/scala-2.12/jars/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar)
>  to method java.nio.Bits.unaligned()
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33929) Spark-submit with --package deequ doesn't pull all jars

2020-12-28 Thread Dustin Smith (Jira)
Dustin Smith created SPARK-33929:


 Summary: Spark-submit with --package deequ doesn't pull all jars
 Key: SPARK-33929
 URL: https://issues.apache.org/jira/browse/SPARK-33929
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.4, 2.3.3, 2.3.2, 2.3.1, 2.3.0
Reporter: Dustin Smith


This issue was marked as solved 
[SPARK-24074|https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel];
 however, another [~hyukjin.kwon] pointed out in the comments that version 2.4x 
was experiencing this same problem. 

This problem exist in 2.3.x ecosystem as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33929) Spark-submit with --package deequ doesn't pull all jars

2020-12-28 Thread Dustin Smith (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dustin Smith updated SPARK-33929:
-
Description: 
This issue was marked as solved SPARK-24074; however, another [~hyukjin.kwon] 
pointed out in the comments that version 2.4x was experiencing this same 
problem when using Amazon Deequ.

This problem exist in 2.3.x ecosystem as well for Deequ.

  was:
This issue was marked as solved 
[SPARK-24074|https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel];
 however, another [~hyukjin.kwon] pointed out in the comments that version 2.4x 
was experiencing this same problem. 

This problem exist in 2.3.x ecosystem as well.


> Spark-submit with --package deequ doesn't pull all jars
> ---
>
> Key: SPARK-33929
> URL: https://issues.apache.org/jira/browse/SPARK-33929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4
>Reporter: Dustin Smith
>Priority: Major
>
> This issue was marked as solved SPARK-24074; however, another [~hyukjin.kwon] 
> pointed out in the comments that version 2.4x was experiencing this same 
> problem when using Amazon Deequ.
> This problem exist in 2.3.x ecosystem as well for Deequ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-09-29 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203753#comment-17203753
 ] 

Dustin Smith commented on SPARK-32046:
--

[~planga82] this works every time without issue
java.time.LocalDateTime.now

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-09-28 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203598#comment-17203598
 ] 

Dustin Smith commented on SPARK-32046:
--

[~planga82] the fact that the dataframes are named differently doesn't constant 
a different plan since they have the same columns? The time should be a unique 
value but in your example you are using a name column specifying the variable 
names making them unique. 

Just so understand. I will most likely use the java implementation as of now 
though since it doesn't introduce added columns.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-09-28 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203043#comment-17203043
 ] 

Dustin Smith edited comment on SPARK-32046 at 9/28/20, 7:12 AM:


[~planga82] the use case is related to a working environment. We need to do 
some different processing that can finish at different times. However, we need 
to record the times and merge all the dataframes. Caching the dataframes to 
freeze the time so next time the data is called the time wouldn't change.

 

In my toy problem, I just created df 1-3. In this example, I would want all to 
have the time when they were cached so 3 separate times. I do have a work 
around using the java implementation but this spark implementation just seems 
odd to me.


was (Author: dustin.smith.tdg):
[~planga82] the use case is related to a working environment. We need to do 
some different processing that can finish at different times. However, we need 
to record the times and merge all the dataframes. Caching the dataframes to 
freeze the time so next time the data is called the time wouldn't change.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-09-28 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203043#comment-17203043
 ] 

Dustin Smith commented on SPARK-32046:
--

[~planga82] the use case is related to a working environment. We need to do 
some different processing that can finish at different times. However, we need 
to record the times and merge all the dataframes. Caching the dataframes to 
freeze the time so next time the data is called the time wouldn't change.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-14 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177584#comment-17177584
 ] 

Dustin Smith commented on SPARK-32046:
--

[~maropu] yes one would expect the current timestamp to be applied per 
dataframe for each call. My Jira ticket is about the fact that all dataframes 
get the same timestamp. However, this isn't the case. Once current timestamp is 
called once, that is the time. That is, even if we have a new dataframe with a 
new execution query and a new query plan, the time will be the from the first 
call in shell. On Jupyter and ZP, it will increment twice before freezing.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-13 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177401#comment-17177401
 ] 

Dustin Smith edited comment on SPARK-32046 at 8/14/20, 12:51 AM:
-

[~maropu] the definition of current timestamp is as follows:
{code:java}
current_timestamp() - Returns the current timestamp at the start of query 
evaluation.
{code}
The question when does a query evaluation start and stop? Do mutual exclusive 
dataframes being processed consist of the same query evaluation? If yes, then 
current timestamp's behavior in spark shell is correct; however, as user, that 
would be extremely undesirable behavior. I would rather cache the current 
timestamp and call it again for a new time. 

Now if a query evaluation stops once it is executed and starts anew when 
another dataframe or action is called, then the behavior in shell and notebooks 
are incorrect. The notebooks are only correct for a few runs and then default 
to not changing.

[https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp]

Additionally, whatever behavior is correct or should be correct is not 
consistent and more robust testing should occur in my opinion.

 

As an after thought, the name current timestamp doesn't make sense if the time 
is supposed to freeze after one call. Really it is current timestamp once and 
beyond that call it is no longer current. 


was (Author: dustin.smith.tdg):
[~maropu] the definition of current timestamp is as follows:
{code:java}
current_timestamp() - Returns the current timestamp at the start of query 
evaluation.
{code}
The question when does a query evaluation start and stop? Do mutual exclusive 
dataframes being processed consist of the same query evaluation? If yes, then 
current timestamp's behavior in spark shell is correct; however, as user, that 
would be extremely undesirable behavior. I would rather cache the current 
timestamp and call it again for a new time. 

Now if a query evaluation stops once it is executed and starts anew when 
another dataframe or action is called, then the behavior in shell and notebooks 
are incorrect. The notebooks are only correct for a few runs and then default 
to not changing.

[https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp]

Additionally, whatever behavior is correct or should be correct is not 
consistent and more robust testing should occur in my opinion.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-13 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177401#comment-17177401
 ] 

Dustin Smith edited comment on SPARK-32046 at 8/14/20, 12:49 AM:
-

[~maropu] the definition of current timestamp is as follows:
{code:java}
current_timestamp() - Returns the current timestamp at the start of query 
evaluation.
{code}
The question when does a query evaluation start and stop? Do mutual exclusive 
dataframes being processed consist of the same query evaluation? If yes, then 
current timestamp's behavior in spark shell is correct; however, as user, that 
would be extremely undesirable behavior. I would rather cache the current 
timestamp and call it again for a new time. 

Now if a query evaluation stops once it is executed and starts anew when 
another dataframe or action is called, then the behavior in shell and notebooks 
are incorrect. The notebooks are only correct for a few runs and then default 
to not changing.

[https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp]

Additionally, whatever behavior is correct or should be correct is not 
consistent and more robust testing should occur in my opinion.


was (Author: dustin.smith.tdg):
[~maropu] the definition of current timestamp is as follows:
{code:java}
current_timestamp() - Returns the current timestamp at the start of query 
evaluation.
{code}
The question when does a query evaluation start and stop? Do mutual exclusive 
dataframes being processed consist of the same query evaluation? If yes, then 
current timestamp's behavior in spark shell is correct; however, as user, that 
would be extremely undesirable behavior. I would rather cache the current 
timestamp and call it again for a new time. 

Now if a query evaluation stops once it is executed and starts anew when 
another dataframe or action is called, then the behavior in shell and notebooks 
are incorrect. The notebooks are only correct for a few runs and then default 
to not changing.

[https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp]

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-13 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177401#comment-17177401
 ] 

Dustin Smith commented on SPARK-32046:
--

[~maropu] the definition of current timestamp is as follows:
{code:java}
current_timestamp() - Returns the current timestamp at the start of query 
evaluation.
{code}
The question when does a query evaluation start and stop? Do mutual exclusive 
dataframes being processed consist of the same query evaluation? If yes, then 
current timestamp's behavior in spark shell is correct; however, as user, that 
would be extremely undesirable behavior. I would rather cache the current 
timestamp and call it again for a new time. 

Now if a query evaluation stops once it is executed and starts anew when 
another dataframe or action is called, then the behavior in shell and notebooks 
are incorrect. The notebooks are only correct for a few runs and then default 
to not changing.

[https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp]

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-09 Thread Dustin Smith (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dustin Smith updated SPARK-32046:
-
Affects Version/s: 3.0.0

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-06-30 Thread Dustin Smith (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dustin Smith updated SPARK-32046:
-
Description: 
If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframe's time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond 
(when running on Zeppelin or Jupyter).

Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
However,
{code:java}
val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
df.count

// this can be run 3 times no issue.
// then later cast to TimestampType{code}
doesn't have this problem and all 3 dataframes cache with correct times 
displaying.

Running the code in shell and Jupyter or Zeppelin (ZP) also produces different 
results. In the shell, you only get 1 unique time no matter how many times you 
run it, current_timestamp. However, in ZP or Jupyter I have always received 2 
unique times before it froze.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}

  was:
If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond.

Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
However,
{code:java}
val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
df.count

// this can be run 3 times no issue.
// then later cast to TimestampType{code}
doesn't have this problem and all 3 dataframes cache with correct times 
displaying.

Running the code in shell and Jupyter or Zeppelin (ZP) also produces different 
results. In the shell, you only get 1 unique time no matter how many times you 
run it, current_timestamp. However, in ZP or Jupyter I have always received 2 
unique times before it froze.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}


> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-06-29 Thread Dustin Smith (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dustin Smith updated SPARK-32046:
-
Description: 
If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond.

Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
However,
{code:java}
val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
df.count

// this can be run 3 times no issue.
// then later cast to TimestampType{code}
doesn't have this problem and all 3 dataframes cache with correct times 
displaying.

Running the code in shell and Jupyter or Zeppelin (ZP) also produces different 
results. In the shell, you only get 1 unique time no matter how many times you 
run it, current_timestamp. However, in ZP or Jupyter I have always received 2 
unique times before it froze.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}

  was:
If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond.

Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
However,
{code:java}
val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
df.count

// this can be run 3 times no issue.
// then later cast to TimestampType{code}
doesn't have this problem and all 3 dataframes cache with correct times 
displaying.

Running the code in shell and Jupyter or Zeppelin also produces different 
results. In the shell, you only get 1 time all 3 times.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}


> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
> 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and 
> the 2nd will differ in time but will become static on the 3rd usage and 
> beyond.
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-06-29 Thread Dustin Smith (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dustin Smith updated SPARK-32046:
-
Labels: caching sql time  (was: )

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
> 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and 
> the 2nd will differ in time but will become static on the 3rd usage and 
> beyond.
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin also produces different 
> results. In the shell, you only get 1 time all 3 times.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-06-22 Thread Dustin Smith (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dustin Smith updated SPARK-32046:
-
Description: 
If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond.

Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
However,
{code:java}
val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
df.count

// this can be run 3 times no issue.
// then later cast to TimestampType{code}
doesn't have this problem and all 3 dataframes cache with correct times 
displaying.

Running the code in shell and Jupyter or Zeppelin also produces different 
results. In the shell, you only get 1 time all 3 times.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}

  was:
If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond.

Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
However,
{code:java}
val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
df.count

// this can be run 3 times no issue.{code}
doesn't have this problem and all 3 dataframes cache with correct times 
displaying.

Running the code in shell and Jupyter or Zeppelin also produces different 
results. In the shell, you only get 1 time all 3 times.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}


> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Dustin Smith
>Priority: Minor
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
> 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and 
> the 2nd will differ in time but will become static on the 3rd usage and 
> beyond.
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin also produces different 
> results. In the shell, you only get 1 time all 3 times.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-06-22 Thread Dustin Smith (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dustin Smith updated SPARK-32046:
-
Description: 
If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond.

Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
However,
{code:java}
val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
df.count

// this can be run 3 times no issue.{code}
doesn't have this problem and all 3 dataframes cache with correct times 
displaying.

Running the code in shell and Jupyter or Zeppelin also produces different 
results. In the shell, you only get 1 time all 3 times.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}

  was:
If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond.

Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
However, `Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache` 
doesn't have this problem and all 3 dataframes cache with correct times 
displaying.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}


> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Dustin Smith
>Priority: Minor
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
> 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and 
> the 2nd will differ in time but will become static on the 3rd usage and 
> beyond.
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin also produces different 
> results. In the shell, you only get 1 time all 3 times.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-06-22 Thread Dustin Smith (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dustin Smith updated SPARK-32046:
-
Description: 
If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond.

Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
However, `Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache` 
doesn't have this problem and all 3 dataframes cache with correct times 
displaying.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}

  was:
If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}


> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Dustin Smith
>Priority: Minor
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
> 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and 
> the 2nd will differ in time but will become static on the 3rd usage and 
> beyond.
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However, `Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache` 
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-06-22 Thread Dustin Smith (Jira)
Dustin Smith created SPARK-32046:


 Summary: current_timestamp called in a cache dataframe freezes the 
time for all future calls
 Key: SPARK-32046
 URL: https://issues.apache.org/jira/browse/SPARK-32046
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 2.3.0
Reporter: Dustin Smith


If I call current_timestamp 3 times while caching the dataframe variable in 
order to freeze that dataframes time, the 3rd dataframe time and beyond (4th, 
5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe and the 
2nd will differ in time but will become static on the 3rd usage and beyond.

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as "datetime").cache
df1.count

df1.show(false)

Thread.sleep(9500)

val df2 = spark.range(1).select(current_timestamp as "datetime").cache
df2.count 

df2.show(false)

Thread.sleep(9500)

val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
df3.count 

df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org