from:"\"Maxim Gekk \\\(Jira\\\)\""

[jira] [Created] (SPARK-31669) RowEncoderSuite.encode/decode fails on 1000-02-29

2020-05-09 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31669:
--

 Summary: RowEncoderSuite.encode/decode fails on 1000-02-29
 Key: SPARK-31669
 URL: https://issues.apache.org/jira/browse/SPARK-31669
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Maxim Gekk


Here is the failure 
https://github.com/apache/spark/pull/28481#issuecomment-626034381:
{code}
org.scalatest.exceptions.TestFailedException:  schema: 
struct
 input: 
[null,true,127,-32768,684257610,3148440411416190456,Infinity,null,2.036236359763072870,ം뵡所碆ᚯ᧳ꒁ밯झ᧱휽⑲岫遳翁㎊륣䓵읹씶읽Â␣⪸붵끂꩖⟭䶄裻乌⇇깍뵙࿻偁뷩셙녶퐾귘嫫䍧쩔ꆁ䠾ՠ訣췐つ亙⚓깠긄蚣꿞묌泓㘡ⵆ橾櫻膋뿽⮎㖍杘䊣臼穇붘켑镅抎灕쿿ァ쏍㤰酀旬槳鑻槸놛턌춅ꉪ陪⡉法耸郄篍㹏吡ط汢측䱣
 
莶婚ⳟ슿쓻̷흖〦湶ဎ銓霁叹롄ᯕ珅䅃卩慗銁묠쯟ሄ啕澻矌軈憃䑋餤I쒚ᡭ⪩⚋湐蒒ジ䝱綅媪㍉芸礮猱耳藁笲⽽壶젅溜穸⫾룚྿뇳Ѩ䍢넪谦⎠줊넳楼橨䖊ꪗꚔ鬜⋍羯ሾ삦毜਷뢍෍⛛᭟莽糸꟤픣좖뮋撜혍牭ӎ뢂験ꆪᩉ跙㌌ᔸꦐ〷旽k텁ଘ쩧媉❛뛽뷺㱂᪭挃ቿ셾⁞邞郰홋쀘ᜍ뉿ഁ꓌迭梽ዳ硟崤쑼놱뎬蓬覄挗뾱뉍枈懂⼞ܭ갸첟ᢍ燃Ò䦛∫㦿ᶡ랗ђ䓸쑾퀷ၓ鍖霃솄⨗얔嫚ꨵ캁큰࿩ߢ䌵Ⓡ扛郾ꌟ䫀㑈瓺냾෍厌ᇗ玹띏푏㻛䰁ᤠ邰굇뷉恃ᦜ⶞쾖戀諕돚裹聼鬽劙Ἱ䏐烗䢭뉁ꏼⴾ欆⛺坶磩̿꽦⾩綬跩玉谩嶂퇗떾心鵈짘쒸봐傱䦂殏┗ח듅宥㣠ꘙ㽟忽ଟ겚鄀梧ж䋁癫剫㠉繮੫ݽ櫌非剖䤖噹뫒圏쬧罍氒ញ梶印䶋蝗杨윇鬑䰡笤㜇梀큦먚碈蝠⒊쩔蹂ૡወ쵩襒ᇳ擴ꓙ踘짧㤫倍趯鱘剨궐ঔ⇮ᶄಂ꼉⨛插柬ᒠ뇯뉒Ⱛ돝ヘ枀冗ꈑ筚綃놪㞴滅䷀ቿ䋃絚孝⏍ɞԃ灚诔懠卮쬸υ뇺闭䆲葞颫頴渋皒夂Ⲧ蟹폊綘ꥄ悈匢觏奴둇⺮웧쭑析윘ⴉ㯒罧䔫妬滢顂⺀ᶠ洷㈋祵鼲꿓阤煳⪧耒襠Ὀ蒣尥鴴⿥涜⭕넳ⶬꑷ㢭憾휦蘀暫줔䐳Ӏ膲뜊꾓휔⤻染肽㉟Ὲ돋⏦턝໫⨋噴䡧☘蟾違숶籩헺Ꮼ͵ळ⣣ੲ憋ጴ癤Ś泣ࢨ뎜뚗꽳텭⽊ꦞ⍝臬슯챑捒ᐑ薯闌巡猰恝ᘱ퇨倶掫ύ矞㹿䱟᭵ජꞥ푥✠儦慮௚齵ệ艝傫⠤⾿챔លͬ츂궄裐편ἵ핗곐촂Ѷ鋟ໄ櫷諩艄掽᧡輎ଇ颁굺㒔企鲺脞稯흂휾ꆲ駊㲹恾暤ź沥咺ଅᖣ嶀㱎쐢꼕㮚ⴞĒ䯭튔㹶ꯜꇙ廦㏚颿垌빫ࠣ悰흥꧆괱鈋暶ᭇ燙㐇뿜閆䩋쾽䉄ៈᵅ칇ચ厑济갺캜㤩봫껫衴㎱롺藪夞䃡㮛픳餣୅최껐ꮾꃼ友Ῡ磗༩ꡐ흏崋䰖牀㨊䞋ᓊ㺧೏ᔣꥱ룛ᚁ἟爥呯ᩮၥ㴳᭿㗀籧鮶噿浦ٰ癝牻⬬䷗㽂醙ꨞੇ굾鏬⑰酚곥ℰ菁εⓤ嶐媒帊녲湙犉ܒ啹⾧孨䜸錸ஊ쐡ᾫ㊮夒䇏繍힂ᡗ奄輽섚肫쀺왗隬㨖ⲝⵙ껽狇貥෫孒톶鄜趿滃逅ꨎ䫻⁡箚美뮣湾梠贉遚줐㞻䴳떛齿楂ᣀ䟯再ꨬ驂䉭ꇜ,[B@3f1a4861,1970-01-01,1000-02-29
 10:11:12.123,null]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31665) Test parquet dictionary encoding of random dates/timestamps

2020-05-08 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31665:
--

 Summary: Test parquet dictionary encoding of random 
dates/timestamps
 Key: SPARK-31665
 URL: https://issues.apache.org/jira/browse/SPARK-31665
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Currently, dictionary encoding is not tested in ParquetHadoopFsRelationSuite 
test "test all data types" because dates and timestamps are uniformly 
distributed, and dictionary encoding is not applied for the types in fact. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31662) Reading wrong dates from dictionary encoded columns in Parquet files

2020-05-08 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31662:
---
Description: 
Write dates with dictionary encoding enabled to parquet files:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", 
true)

scala> :paste
// Entering paste mode (ctrl-D to finish)

  Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
.select($"dateS".cast("date").as("date"))
.repartition(1)
.write
.option("parquet.enable.dictionary", true)
.mode("overwrite")
.parquet("/Users/maximgekk/tmp/parquet-date-dict")

// Exiting paste mode, now interpreting.
{code}

Read them back:
{code:scala}
scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
+--+
|date  |
+--+
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
+--+
{code}

*Expected values must be 1000-01-01.*

I checked that the date column is encoded by dictionary via:
{code}
➜  parquet-date-dict java -jar ~/Downloads/parquet-tools-1.12.0.jar dump 
./part-0-84a77214-0c8c-45e9-ac41-5ca863b9dd94-c000.snappy.parquet
row group 0

date:  INT32 SNAPPY DO:0 FPO:4 SZ:74/70/0.95 VC:8 ENC:BIT_PACKED,RLE,P [more]...
date TV=8 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY

page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY 
[more]... VC:8
INT32 date

*** row group 1 of 1, values 1 to 8 ***
value 1: R:0 D:1 V:1001-01-07
value 2: R:0 D:1 V:1001-01-07
value 3: R:0 D:1 V:1001-01-07
value 4: R:0 D:1 V:1001-01-07
value 5: R:0 D:1 V:1001-01-07
value 6: R:0 D:1 V:1001-01-07
value 7: R:0 D:1 V:1001-01-07
value 8: R:0 D:1 V:1001-01-07
{code}

  was:
Write dates with dictionary encoding enabled to parquet files:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", 
true)

scala> :paste
// Entering paste mode (ctrl-D to finish)

  Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
.select($"dateS".cast("date").as("date"))
.repartition(1)
.write
.option("parquet.enable.dictionary", true)
.mode("overwrite")
.parquet("/Users/maximgekk/tmp/parquet-date-dict")

// Exiting paste mode, now interpreting.
{code}

Read them back:
{code:scala}
scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
+--+
|date  |
+--+
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
+--+
{code}

*Expected values must be 1000-01-01.*


> Reading wrong dates from dictionary encoded columns in Parquet files
> 
>
> Key: SPARK-31662
> URL: https://issues.apache.org/jira/browse/SPARK-31662
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates with dictionary encoding enabled to parquet files:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> 
> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
>   Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
> .select($"dateS".cast("date").as("date"))
> .repartition(1)
> .write
> .option("parquet.enable.dictionary", true)
> .mode("overwrite")
> .parquet("/Users/maximgekk/tmp/parquet-date-dict")
> // Exiting paste mode, now interpreting.
> {code}
> Read them back:
>

[jira] [Created] (SPARK-31662) Reading wrong dates from dictionary encoded columns in Parquet files

2020-05-08 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31662:
--

 Summary: Reading wrong dates from dictionary encoded columns in 
Parquet files
 Key: SPARK-31662
 URL: https://issues.apache.org/jira/browse/SPARK-31662
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Maxim Gekk


Write dates with dictionary encoding enabled to parquet files:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", 
true)

scala> :paste
// Entering paste mode (ctrl-D to finish)

  Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
.select($"dateS".cast("date").as("date"))
.repartition(1)
.write
.option("parquet.enable.dictionary", true)
.mode("overwrite")
.parquet("/Users/maximgekk/tmp/parquet-date-dict")

// Exiting paste mode, now interpreting.
{code}

Read them back:
{code:scala}
scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false)
+--+
|date  |
+--+
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
|1001-01-07|
+--+
{code}

*Expected values must be 1000-01-01.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31641) Incorrect days conversion by JSON legacy parser

2020-05-05 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31641:
--

 Summary: Incorrect days conversion by JSON legacy parser
 Key: SPARK-31641
 URL: https://issues.apache.org/jira/browse/SPARK-31641
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Maxim Gekk


Spark 2.4.5:
{code:scala}
scala> val ds = Seq("{'d': '-141704'}").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val json = spark.read.schema("d date").json(ds)
json: org.apache.spark.sql.DataFrame = [d: date]

scala> json.show
+--+
| d|
+--+
|1582-01-01|
+--+
{code}

Spark 3.1.0-SNAPSHOT:
{code:scala}
scala> val ds = Seq("{'d': '-141704'}").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val json = spark.read.schema("d date").json(ds)
json: org.apache.spark.sql.DataFrame = [d: date]

scala> json.show
+--+
| d|
+--+
|1582-01-11|
+--+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-05-04 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099579#comment-17099579
 ] 

Maxim Gekk commented on SPARK-31579:


[~suddhuASF] The replace floorDiv by / is trivial. Please, write a code which 
proofs that first of all, and post it here in a comment.  /cc [~cloud_fan] 
[~hyukjin.kwon] The code should go over all available time zones with the step 
of 1 hours + jitter of a few minutes.

> Replace floorDiv by / in localRebaseGregorianToJulianDays()
> ---
>
> Key: SPARK-31579
> URL: https://issues.apache.org/jira/browse/SPARK-31579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: starter
>
> Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check 
> that for all available time zones in the range of [0001, 2100] years with the 
> step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv 
> can be replaced by /, and this should improve performance of 
> RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31630) Skip timestamp rebasing after 1900-01-01

2020-05-03 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31630:
--

 Summary: Skip timestamp rebasing after 1900-01-01
 Key: SPARK-31630
 URL: https://issues.apache.org/jira/browse/SPARK-31630
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Maxim Gekk


The conversions of Catalyst's DATE/TIMESTAMPS to/from Java's types 
java.sql.Date/java.sql.Timestamps have almost the same implementation except 
addition rebasing op. If we look at switch and diffs arrays of all available 
time zones, we can detect that there is a time point when all diffs are 0. This 
is 1900-01-01 00:00:00Z. So, we can compare input micros with the time point 
and skip conversion for modern timestamps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31623) Benchmark rebasing of INT96 and TIMESTAMP_MILLIS timestamps in read/write

2020-05-01 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31623:
--

 Summary: Benchmark rebasing of INT96 and TIMESTAMP_MILLIS 
timestamps in read/write
 Key: SPARK-31623
 URL: https://issues.apache.org/jira/browse/SPARK-31623
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Add benchmark cases to DateTimeRebaseBenchmark for:
# Read/Write INT96 timestamps
# Read/Write TIMESTAMP_MILLIS w/ rebasing on/off



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite

2020-04-29 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-31554.

Resolution: Not A Problem

> Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite
> 
>
> Key: SPARK-31554
> URL: https://issues.apache.org/jira/browse/SPARK-31554
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, 
> for example:
> * https://github.com/apache/spark/pull/28328#issuecomment-618992335
> The error message:
> {code}
> org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error 
> reporting
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with 
> error line 'Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152)
>   at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188)
>   at 
> scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192)
>   at 
> org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30)
> {code}
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/27617#issuecomment-614318644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-04-27 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31579:
--

 Summary: Replace floorDiv by / in 
localRebaseGregorianToJulianDays()
 Key: SPARK-31579
 URL: https://issues.apache.org/jira/browse/SPARK-31579
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check that 
for all available time zones in the range of [0001, 2100] years with the step 
of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv can be 
replaced by /, and this should improve performance of 
RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation

2020-04-26 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092824#comment-17092824
 ] 

Maxim Gekk commented on SPARK-31449:


[~cloud_fan] [~hyukjin.kwon] I compared results of those 2 functions for all 
time zones with step of 1 day, and found many differences in results:
{code:scala}
test("Investigate the difference between JDK and Spark's time zone offset 
calculation") {
import java.util.{Calendar, TimeZone}
import sun.util.calendar.ZoneInfo
def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): Long = {
  var guess = tz.getRawOffset
  // the actual offset should be calculated based on milliseconds in UTC
  val offset = tz.getOffset(millisLocal - guess)
  if (offset != guess) {
guess = tz.getOffset(millisLocal - offset)
if (guess != offset) {
  // fallback to do the reverse lookup using java.sql.Timestamp
  // this should only happen near the start or end of DST
  val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
  val year = getYear(days)
  val month = getMonth(days)
  val day = getDayOfMonth(days)

  var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
  if (millisOfDay < 0) {
millisOfDay += MILLIS_PER_DAY.toInt
  }
  val seconds = (millisOfDay / 1000L).toInt
  val hh = seconds / 3600
  val mm = seconds / 60 % 60
  val ss = seconds % 60
  val ms = millisOfDay % 1000
  val calendar = Calendar.getInstance(tz)
  calendar.set(year, month - 1, day, hh, mm, ss)
  calendar.set(Calendar.MILLISECOND, ms)
  guess = (millisLocal - calendar.getTimeInMillis()).toInt
}
  }
  guess
}
def getOffsetFromLocalMillis2(millisLocal: Long, tz: TimeZone): Long = {
  tz match {
case zoneInfo: ZoneInfo => zoneInfo.getOffsetsByWall(millisLocal, null)
case timeZone: TimeZone => timeZone.getOffset(millisLocal - 
timeZone.getRawOffset)
  }
}

ALL_TIMEZONES
  .sortBy(_.getId)
  .foreach { zid =>
withDefaultTimeZone(zid) {
  val start = microsToMillis(instantToMicros(LocalDateTime.of(1, 1, 1, 
0, 0, 0)
.atZone(zid)
.toInstant))
  val end = microsToMillis(instantToMicros(LocalDateTime.of(2037, 1, 1, 
0, 0, 0)
.atZone(zid)
.toInstant))

  var millis = start
  var step: Long = MILLIS_PER_DAY
  while (millis < end) {
val offset1 = getOffsetFromLocalMillis(millis, 
TimeZone.getTimeZone(zid))
val offset2 = getOffsetFromLocalMillis2(millis, 
TimeZone.getTimeZone(zid))
if (offset1 != offset2) {
  println(s"${zid.getId} ${new Timestamp(millis)} $offset1 
$offset2")
}
millis += step
  }
}
  }
  }
{code}
{code}
Africa/Algiers 1916-10-01 23:47:48.0 360 0
Africa/Algiers 1917-10-07 23:47:48.0 360 0
Africa/Algiers 1918-10-06 23:47:48.0 360 0
Africa/Algiers 1919-10-05 23:47:48.0 360 0
Africa/Algiers 1920-10-23 23:47:48.0 360 0
Africa/Algiers 1921-06-21 23:47:48.0 360 0
Africa/Algiers 1946-10-06 23:47:48.0 360 0
Africa/Algiers 1963-04-13 23:47:48.0 360 0
Africa/Algiers 1971-09-26 23:47:48.0 360 0
Africa/Algiers 1979-10-25 23:47:48.0 360 0
Africa/Ceuta 1900-01-01 00:00:00.0 360 -1276000
Africa/Ceuta 1924-10-05 00:21:16.0 360 0
Africa/Ceuta 1926-10-03 00:21:16.0 360 0
Africa/Ceuta 1927-10-02 00:21:16.0 360 0
Africa/Ceuta 1928-10-07 00:21:16.0 360 0
Africa/Sao_Tome 1899-12-31 23:33:04.0 0 -2205000
Africa/Tripoli 1952-01-01 00:07:16.0 720 360
Africa/Tripoli 1954-01-01 00:07:16.0 720 360
Africa/Tripoli 1956-01-01 00:07:16.0 720 360
Africa/Tripoli 1982-01-01 00:07:16.0 720 360
Africa/Tripoli 1982-10-01 00:07:16.0 720 360
Africa/Tripoli 1983-10-01 00:07:16.0 720 360
Africa/Tripoli 1984-10-01 00:07:16.0 720 360
Africa/Tripoli 1985-10-01 00:07:16.0 720 360
Africa/Tripoli 1986-10-03 00:07:16.0 720 360
Africa/Tripoli 1987-10-01 00:07:16.0 720 360
Africa/Tripoli 1988-10-01 00:07:16.0 720 360
Africa/Tripoli 1989-10-01 00:07:16.0 720 360
Africa/Tripoli 1996-09-30 00:07:16.0 720 360
America/Inuvik 1965-10-30 18:00:00.0 -2160 -2880
America/Iqaluit 1999-10-30 20:00:00.0 -1440 -2160
America/Pangnirtung 1999-10-30 20:00:00.0 -1440 -2160
Antarctica/Casey 1900-01-01 00:00:00.0 2880 0
Antarctica/Davis 1900-01-01 00:00:00.0 2520 0
Antarctica/Davis 2009-10-18 05:00:00.0 2520 1800
Antarctica/Davis 2011-10-28 05:00:00.0 2520 1800
Antarctica/DumontDUrville 1900-01-01 00:00:00.0 3600 0
Antarctica/Mawson 1900-01-01 00:00:00.0 1800 0
Antarctica/Syowa 1900-01-01 00:00:0

[jira] [Commented] (SPARK-31563) Failure of InSet.sql for UTF8String collection

2020-04-25 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092168#comment-17092168
 ] 

Maxim Gekk commented on SPARK-31563:


I am working on the issue

> Failure of InSet.sql for UTF8String collection
> --
>
> Key: SPARK-31563
> URL: https://issues.apache.org/jira/browse/SPARK-31563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The InSet expression works on collections of internal Catalyst's types. We 
> can see this in the optimization when In is replaced by InSet, and In's 
> collection is evaluated to internal Catalyst's values: 
> [https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L253-L254]
> {code:scala}
> if (newList.length > SQLConf.get.optimizerInSetConversionThreshold) {
>   val hSet = newList.map(e => e.eval(EmptyRow))
>   InSet(v, HashSet() ++ hSet)
> }
> {code}
> The code existed before the optimization 
> https://github.com/apache/spark/pull/25754 that made another wrong assumption 
> about collection types.
> If InSet accepts only internal Catalyst's types, the following code shouldn't 
> fail:
> {code:scala}
> InSet(Literal("a"), Set("a", "b").map(UTF8String.fromString)).sql
> {code}
> but it fails with the exception:
> {code}
> Unsupported literal type class org.apache.spark.unsafe.types.UTF8String a
> java.lang.RuntimeException: Unsupported literal type class 
> org.apache.spark.unsafe.types.UTF8String a
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.InSet.$anonfun$sql$2(predicates.scala:522)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31563) Failure of InSet.sql for UTF8String collection

2020-04-25 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31563:
--

 Summary: Failure of InSet.sql for UTF8String collection
 Key: SPARK-31563
 URL: https://issues.apache.org/jira/browse/SPARK-31563
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 3.0.0, 3.1.0
Reporter: Maxim Gekk


The InSet expression works on collections of internal Catalyst's types. We can 
see this in the optimization when In is replaced by InSet, and In's collection 
is evaluated to internal Catalyst's values: 
[https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L253-L254]
{code:scala}
if (newList.length > SQLConf.get.optimizerInSetConversionThreshold) {
  val hSet = newList.map(e => e.eval(EmptyRow))
  InSet(v, HashSet() ++ hSet)
}
{code}
The code existed before the optimization 
https://github.com/apache/spark/pull/25754 that made another wrong assumption 
about collection types.

If InSet accepts only internal Catalyst's types, the following code shouldn't 
fail:
{code:scala}
InSet(Literal("a"), Set("a", "b").map(UTF8String.fromString)).sql
{code}
but it fails with the exception:
{code}
Unsupported literal type class org.apache.spark.unsafe.types.UTF8String a
java.lang.RuntimeException: Unsupported literal type class 
org.apache.spark.unsafe.types.UTF8String a
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:88)
at 
org.apache.spark.sql.catalyst.expressions.InSet.$anonfun$sql$2(predicates.scala:522)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite

2020-04-24 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091614#comment-17091614
 ] 

Maxim Gekk commented on SPARK-31554:


[~cloud_fan] [~hyukjin.kwon] Can I we disable the flaky test till someone makes 
it stable?

> Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite
> 
>
> Key: SPARK-31554
> URL: https://issues.apache.org/jira/browse/SPARK-31554
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, 
> for example:
> * https://github.com/apache/spark/pull/28328#issuecomment-618992335
> The error message:
> {code}
> org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error 
> reporting
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with 
> error line 'Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152)
>   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152)
>   at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188)
>   at 
> scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192)
>   at 
> org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30)
> {code}
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/28261#issuecomment-618950225
> * https://github.com/apache/spark/pull/27617#issuecomment-614318644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite

2020-04-24 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31554:
--

 Summary: Flaky test suite 
org.apache.spark.sql.hive.thriftserver.CliSuite
 Key: SPARK-31554
 URL: https://issues.apache.org/jira/browse/SPARK-31554
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, for 
example:
* https://github.com/apache/spark/pull/28328#issuecomment-618992335
The error message:
{code}
org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error 
reporting
Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with 
error line 'Exception in thread "main" org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
at 
org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135)
at 
org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152)
at 
org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152)
at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188)
at 
scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192)
at 
org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30)
{code}
* https://github.com/apache/spark/pull/28261#issuecomment-618950225
* https://github.com/apache/spark/pull/28261#issuecomment-618950225
* https://github.com/apache/spark/pull/27617#issuecomment-614318644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31553) Wrong result of isInCollection for large collections

2020-04-24 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091490#comment-17091490
 ] 

Maxim Gekk commented on SPARK-31553:


I am working on the issue

> Wrong result of isInCollection for large collections
> 
>
> Key: SPARK-31553
> URL: https://issues.apache.org/jira/browse/SPARK-31553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> If the size of a collection passed to isInCollection is bigger than 
> spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
> results for some inputs. For example:
> {code:scala}
> val set = (0 to 20).map(_.toString).toSet
> val data = Seq("1").toDF("x")
> println(set.contains("1"))
> data.select($"x".isInCollection(set).as("isInCollection")).show()
> {code}
> {code}
> true
> +--+
> |isInCollection|
> +--+
> | false|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31553) Wrong result of isInCollection for large collections

2020-04-24 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31553:
--

 Summary: Wrong result of isInCollection for large collections
 Key: SPARK-31553
 URL: https://issues.apache.org/jira/browse/SPARK-31553
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Maxim Gekk


If the size of a collection passed to isInCollection is bigger than 
spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
results for some inputs. For example:
{code:scala}
val set = (0 to 20).map(_.toString).toSet
val data = Seq("1").toDF("x")
println(set.contains("1"))
data.select($"x".isInCollection(set).as("isInCollection")).show()
{code}
{code}
true
+--+
|isInCollection|
+--+
| false|
+--+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-24 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091389#comment-17091389
 ] 

Maxim Gekk commented on SPARK-31463:


Parsing itself takes 10-20%. JSON datasource spends significant time in 
conversions to desired types according to schema. Even if you improve 
performance of parsing by a few times, the total impact will be not so 
significant.

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation

2020-04-24 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31449:
---
Summary: Investigate the difference between JDK and Spark's time zone 
offset calculation  (was: Is there a difference between JDK and Spark's time 
zone offset calculation)

> Investigate the difference between JDK and Spark's time zone offset 
> calculation
> ---
>
> Key: SPARK-31449
> URL: https://issues.apache.org/jira/browse/SPARK-31449
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Priority: Major
>
> Spark 2.4 calculates time zone offsets from wall clock timestamp using 
> `DateTimeUtils.getOffsetFromLocalMillis()` (see 
> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118):
> {code:scala}
>   private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): 
> Long = {
> var guess = tz.getRawOffset
> // the actual offset should be calculated based on milliseconds in UTC
> val offset = tz.getOffset(millisLocal - guess)
> if (offset != guess) {
>   guess = tz.getOffset(millisLocal - offset)
>   if (guess != offset) {
> // fallback to do the reverse lookup using java.sql.Timestamp
> // this should only happen near the start or end of DST
> val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
> val year = getYear(days)
> val month = getMonth(days)
> val day = getDayOfMonth(days)
> var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
> if (millisOfDay < 0) {
>   millisOfDay += MILLIS_PER_DAY.toInt
> }
> val seconds = (millisOfDay / 1000L).toInt
> val hh = seconds / 3600
> val mm = seconds / 60 % 60
> val ss = seconds % 60
> val ms = millisOfDay % 1000
> val calendar = Calendar.getInstance(tz)
> calendar.set(year, month - 1, day, hh, mm, ss)
> calendar.set(Calendar.MILLISECOND, ms)
> guess = (millisLocal - calendar.getTimeInMillis()).toInt
>   }
> }
> guess
>   }
> {code}
> Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see 
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801:
> {code:java}
> if (zone instanceof ZoneInfo) {
> ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets);
> } else {
> int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ?
> internalGet(ZONE_OFFSET) : 
> zone.getRawOffset();
> zone.getOffsets(millis - gmtOffset, zoneOffsets);
> }
> {code}
> Need to investigate are there any differences in results between 2 approaches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation

2020-04-24 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31449:
---
Issue Type: Improvement  (was: Question)

> Investigate the difference between JDK and Spark's time zone offset 
> calculation
> ---
>
> Key: SPARK-31449
> URL: https://issues.apache.org/jira/browse/SPARK-31449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Priority: Major
>
> Spark 2.4 calculates time zone offsets from wall clock timestamp using 
> `DateTimeUtils.getOffsetFromLocalMillis()` (see 
> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118):
> {code:scala}
>   private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): 
> Long = {
> var guess = tz.getRawOffset
> // the actual offset should be calculated based on milliseconds in UTC
> val offset = tz.getOffset(millisLocal - guess)
> if (offset != guess) {
>   guess = tz.getOffset(millisLocal - offset)
>   if (guess != offset) {
> // fallback to do the reverse lookup using java.sql.Timestamp
> // this should only happen near the start or end of DST
> val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
> val year = getYear(days)
> val month = getMonth(days)
> val day = getDayOfMonth(days)
> var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
> if (millisOfDay < 0) {
>   millisOfDay += MILLIS_PER_DAY.toInt
> }
> val seconds = (millisOfDay / 1000L).toInt
> val hh = seconds / 3600
> val mm = seconds / 60 % 60
> val ss = seconds % 60
> val ms = millisOfDay % 1000
> val calendar = Calendar.getInstance(tz)
> calendar.set(year, month - 1, day, hh, mm, ss)
> calendar.set(Calendar.MILLISECOND, ms)
> guess = (millisLocal - calendar.getTimeInMillis()).toInt
>   }
> }
> guess
>   }
> {code}
> Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see 
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801:
> {code:java}
> if (zone instanceof ZoneInfo) {
> ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets);
> } else {
> int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ?
> internalGet(ZONE_OFFSET) : 
> zone.getRawOffset();
> zone.getOffsets(millis - gmtOffset, zoneOffsets);
> }
> {code}
> Need to investigate are there any differences in results between 2 approaches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31490) Benchmark conversions to/from Java 8 date-time types

2020-04-19 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31490:
--

 Summary: Benchmark conversions to/from Java 8 date-time types
 Key: SPARK-31490
 URL: https://issues.apache.org/jira/browse/SPARK-31490
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Maxim Gekk


DATE and TIMESTAMP column values can be converted to java.sql.Date and 
java.sql.Timestamp (by default), or to Java 8 date-time types 
java.time.LocalDate and java.time.Instant when 
spark.sql.datetime.java8API.enabled is set to true. DateTimeBenchmarks misses 
benchmarks of Java 8 date/timestamps. The ticket aims to fix that. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31489) Failure on pushing down filters with java.time.LocalDate values in ORC

2020-04-19 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31489:
--

 Summary: Failure on pushing down filters with java.time.LocalDate 
values in ORC
 Key: SPARK-31489
 URL: https://issues.apache.org/jira/browse/SPARK-31489
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.0.1
Reporter: Maxim Gekk


When spark.sql.datetime.java8API.enabled is set to true, filters pushed down 
with java.time.LocalDate values to ORC datasource fails with the exception:
{code}
Wrong value class java.time.LocalDate for DATE.EQUALS leaf
java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for 
DATE.EQUALS leaf
at 
org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
at 
org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.(SearchArgumentImpl.java:75)
at 
org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31488) Support `java.time.LocalDate` in Parquet filter pushdown

2020-04-19 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31488:
---
Description: 
Currently, ParquetFilters supports only java.sql.Date values of DateType, and 
explicitly casts Any to java.sql.Date, see
https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176

So, any filters refer to date values are not pushed down to Parquet when 
spark.sql.datetime.java8API.enabled is true.

  was:
Currently, ParquetFilters supports only java.sql.Date values of DateType, and 
explicitly casts Any to java.sql.Date, see
https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176

The code fails with an exception when spark.sql.datetime.java8API.enabled is 
true.


> Support `java.time.LocalDate` in Parquet filter pushdown
> 
>
> Key: SPARK-31488
> URL: https://issues.apache.org/jira/browse/SPARK-31488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, ParquetFilters supports only java.sql.Date values of DateType, and 
> explicitly casts Any to java.sql.Date, see
> https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176
> So, any filters refer to date values are not pushed down to Parquet when 
> spark.sql.datetime.java8API.enabled is true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31488) Support `java.time.LocalDate` in Parquet filter pushdown

2020-04-19 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31488:
--

 Summary: Support `java.time.LocalDate` in Parquet filter pushdown
 Key: SPARK-31488
 URL: https://issues.apache.org/jira/browse/SPARK-31488
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Maxim Gekk


Currently, ParquetFilters supports only java.sql.Date values of DateType, and 
explicitly casts Any to java.sql.Date, see
https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176

The code fails with an exception when spark.sql.datetime.java8API.enabled is 
true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31471) Add a script to run multiple benchmarks

2020-04-17 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31471:
--

 Summary: Add a script to run multiple benchmarks
 Key: SPARK-31471
 URL: https://issues.apache.org/jira/browse/SPARK-31471
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Add a python script to run multiple benchmarks. The script can be taken from 
[https://github.com/apache/spark/pull/27078]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-15 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084308#comment-17084308
 ] 

Maxim Gekk commented on SPARK-31423:


[~bersprockets] I think we should take the next valid date for any not-existed 
dates, see the linked PR.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31449) Is there a difference between JDK and Spark's time zone offset calculation

2020-04-14 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31449:
--

 Summary: Is there a difference between JDK and Spark's time zone 
offset calculation
 Key: SPARK-31449
 URL: https://issues.apache.org/jira/browse/SPARK-31449
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.5
Reporter: Maxim Gekk


Spark 2.4 calculates time zone offsets from wall clock timestamp using 
`DateTimeUtils.getOffsetFromLocalMillis()` (see 
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118):
{code:scala}
  private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): 
Long = {
var guess = tz.getRawOffset
// the actual offset should be calculated based on milliseconds in UTC
val offset = tz.getOffset(millisLocal - guess)
if (offset != guess) {
  guess = tz.getOffset(millisLocal - offset)
  if (guess != offset) {
// fallback to do the reverse lookup using java.sql.Timestamp
// this should only happen near the start or end of DST
val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
val year = getYear(days)
val month = getMonth(days)
val day = getDayOfMonth(days)

var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
if (millisOfDay < 0) {
  millisOfDay += MILLIS_PER_DAY.toInt
}
val seconds = (millisOfDay / 1000L).toInt
val hh = seconds / 3600
val mm = seconds / 60 % 60
val ss = seconds % 60
val ms = millisOfDay % 1000
val calendar = Calendar.getInstance(tz)
calendar.set(year, month - 1, day, hh, mm, ss)
calendar.set(Calendar.MILLISECOND, ms)
guess = (millisLocal - calendar.getTimeInMillis()).toInt
  }
}
guess
  }
{code}

Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see 
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801:
{code:java}
if (zone instanceof ZoneInfo) {
((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets);
} else {
int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ?
internalGet(ZONE_OFFSET) : 
zone.getRawOffset();
zone.getOffsets(millis - gmtOffset, zoneOffsets);
}
{code}

Need to investigate are there any differences in results between 2 approaches.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-14 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083595#comment-17083595
 ] 

Maxim Gekk commented on SPARK-31423:


I have debugged this slightly on Spark 2.4, so, '1582-10-14' falls to the case 
while parsing from UTF8String:
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2762-L2768
{code:java}
// The date is in a "missing" period.
if (!isLenient()) {
throw new IllegalArgumentException("the specified date 
doesn't exist");
}
// Take the Julian date for compatibility, which
// will produce a Gregorian date.
fixedDate = jfd;
{code}
In the strong mode, the code 
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L517
 would throw the exception:
{code}
throw new IllegalArgumentException("the specified date doesn't exist")
{code}
but we are in the "weak" mode, in this way Java 7 GregorianCalendar interprets 
the date especially:
{code}
// Take the Julian date for compatibility, which
 // will produce a Gregorian date.
{code}

The date '1582-10-14' doesn't exist in the hybrid calendar used by Java 7 time 
API. It is questionable how to handle the date in such calendar. 

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should

[jira] [Resolved] (SPARK-31445) Avoid floating-point division in millisToDays

2020-04-14 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-31445.

Resolution: Won't Fix

> Avoid floating-point division in millisToDays
> -
>
> Key: SPARK-31445
> URL: https://issues.apache.org/jira/browse/SPARK-31445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Priority: Minor
>
> As the benchmark https://github.com/MaxGekk/spark/pull/27, and comparison to 
> Spark 3.0+an optimisation of fromJavaDate in 
> https://github.com/apache/spark/pull/28205 show that floating-point ops in 
> millisToDays badly impact on the performance of conversion java.sql.Date to 
> Catalyst's date values. The ticket aims to replace double ops by int/long ops.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31445) Avoid floating-point division in millisToDays

2020-04-14 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31445:
--

 Summary: Avoid floating-point division in millisToDays
 Key: SPARK-31445
 URL: https://issues.apache.org/jira/browse/SPARK-31445
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5
Reporter: Maxim Gekk


As the benchmark https://github.com/MaxGekk/spark/pull/27, and comparison to 
Spark 3.0+an optimisation of fromJavaDate in 
https://github.com/apache/spark/pull/28205 show that floating-point ops in 
millisToDays badly impact on the performance of conversion java.sql.Date to 
Catalyst's date values. The ticket aims to replace double ops by int/long ops.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-14 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083314#comment-17083314
 ] 

Maxim Gekk commented on SPARK-31423:


I am working on the issue.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31443) Perf regression of toJavaDate

2020-04-14 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083217#comment-17083217
 ] 

Maxim Gekk edited comment on SPARK-31443 at 4/14/20, 1:21 PM:
--

FYI [~cloud_fan] I got the numbers on the master without 
https://github.com/apache/spark/pull/28205


was (Author: maxgekk):
FYI [~cloud_fan]

> Perf regression of toJavaDate
> -
>
> Key: SPARK-31443
> URL: https://issues.apache.org/jira/browse/SPARK-31443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> DateTimeBenchmark shows the regression
> Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27]
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date  559603
>   38  8.9 111.8   1.0X
> Collect dates  2306   3221
> 1558  2.2 461.1   0.2X
> {code}
> Current master:
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date 1052   1130
>   73  4.8 210.3   1.0X
> Collect dates  3251   4943
> 1624  1.5 650.2   0.3X
> {code}
> If we subtract preparing DATE column:
> * Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row
> * master is (650.2 - 210.3) = 439 ns/row
> The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - 
> 349.3)/349.3 = 25%



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31443) Perf regression of toJavaDate

2020-04-14 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083217#comment-17083217
 ] 

Maxim Gekk commented on SPARK-31443:


FYI [~cloud_fan]

> Perf regression of toJavaDate
> -
>
> Key: SPARK-31443
> URL: https://issues.apache.org/jira/browse/SPARK-31443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> DateTimeBenchmark shows the regression
> Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27]
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date  559603
>   38  8.9 111.8   1.0X
> Collect dates  2306   3221
> 1558  2.2 461.1   0.2X
> {code}
> Current master:
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date 1052   1130
>   73  4.8 210.3   1.0X
> Collect dates  3251   4943
> 1624  1.5 650.2   0.3X
> {code}
> If we subtract preparing DATE column:
> * Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row
> * master is (650.2 - 210.3) = 439 ns/row
> The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - 
> 349.3)/349.3 = 25%



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31443) Perf regression of toJavaDate

2020-04-14 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31443:
---
Description: 
DateTimeBenchmark shows the regression

Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27]
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date  559603 
> 38  8.9 111.8   1.0X
Collect dates  2306   3221
1558  2.2 461.1   0.2X
{code}
Current master:
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date 1052   1130 
> 73  4.8 210.3   1.0X
Collect dates  3251   4943
1624  1.5 650.2   0.3X
{code}
If we subtract preparing DATE column:
* Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row
* master is (650.2 - 210.3) = 439 ns/row

The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - 
349.3)/349.3 = 25%

  was:
DateTimeBenchmark shows the regression

Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date  614655 
> 43  8.1 122.8   1.0X
{code}

Current master:
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date 1154   1206 
> 46  4.3 230.9   1.0X
{code}

The regression is ~x2.


> Perf regression of toJavaDate
> -
>
> Key: SPARK-31443
> URL: https://issues.apache.org/jira/browse/SPARK-31443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> DateTimeBenchmark shows the regression
> Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27]
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> From java.sql.Date  559603
>   38  8.9 111.8   1.0X
> Collect dates  2306   3221
> 1558  2.2 461.1   0.2X
> {code}
> Current master:
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
> 4.15.0-1063-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> To/from Java's date-time: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> --

[jira] [Created] (SPARK-31443) Perf regression of toJavaDate

2020-04-14 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31443:
--

 Summary: Perf regression of toJavaDate
 Key: SPARK-31443
 URL: https://issues.apache.org/jira/browse/SPARK-31443
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


DateTimeBenchmark shows the regression

Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date  614655 
> 43  8.1 122.8   1.0X
{code}

Current master:
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date 1154   1206 
> 46  4.3 230.9   1.0X
{code}

The regression is ~x2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31439) Perf regression of fromJavaDate

2020-04-13 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31439:
--

 Summary: Perf regression of fromJavaDate
 Key: SPARK-31439
 URL: https://issues.apache.org/jira/browse/SPARK-31439
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


DateTimeBenchmark shows the regression

Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date  614655 
> 43  8.1 122.8   1.0X
{code}

Current master:
{code}

Conversion from/to external types


OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 
4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
To/from java.sql.Timestamp:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

>From java.sql.Date 1154   1206 
> 46  4.3 230.9   1.0X
{code}

The regression is ~x2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files

2020-04-13 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31426:
---
Parent: SPARK-31404
Issue Type: Sub-task  (was: Bug)

> Regression in loading/saving timestamps from/to ORC files
> -
>
> Key: SPARK-31426
> URL: https://issues.apache.org/jira/browse/SPARK-31426
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here are results of DateTimeRebaseBenchmark on the current master branch:
> {code}
> Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 158259877  59877
>0  1.7 598.8   0.0X
> before 1582   61361  61361
>0  1.6 613.6   0.0X
> Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 1582, vec off   48197  48288
>  118  2.1 482.0   1.0X
> after 1582, vec on38247  38351
>  128  2.6 382.5   1.3X
> before 1582, vec off  53179  53359
>  249  1.9 531.8   0.9X
> before 1582, vec on   44076  44268
>  269  2.3 440.8   1.1X
> {code}
> The results of the same benchmark on Spark 2.4.6-SNAPSHOT:
> {code}
> Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 158218858  18858
>0  5.3 188.6   1.0X
> before 1582   18508  18508
>0  5.4 185.1   1.0X
> Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> after 1582, vec off   14063  14177
>  143  7.1 140.6   1.0X
> after 1582, vec on 5955   6029
>  100 16.8  59.5   2.4X
> before 1582, vec off  14119  14126
>7  7.1 141.2   1.0X
> before 1582, vec on5991   6007
>   25 16.7  59.9   2.3X
> {code}
>  Here is the PR with DateTimeRebaseBenchmark backported to 2.4: 
> https://github.com/MaxGekk/spark/pull/27



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-12 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31423:
---
Comment: was deleted

(was: This is intentional behavior because ORC format assumes the hybrid 
calendar (Julian + Gregorian) but Parquet and Avro assume Proleptic Gregorian 
calendar. See https://issues.apache.org/jira/browse/SPARK-30951)

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-12 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082051#comment-17082051
 ] 

Maxim Gekk commented on SPARK-31423:


This is intentional behavior because ORC format assumes the hybrid calendar 
(Julian + Gregorian) but Parquet and Avro assume Proleptic Gregorian calendar. 
See https://issues.apache.org/jira/browse/SPARK-30951

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files

2020-04-12 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31426:
--

 Summary: Regression in loading/saving timestamps from/to ORC files
 Key: SPARK-31426
 URL: https://issues.apache.org/jira/browse/SPARK-31426
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Here are results of DateTimeRebaseBenchmark on the current master branch:
{code}
Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

after 158259877  59877  
 0  1.7 598.8   0.0X
before 1582   61361  61361  
 0  1.6 613.6   0.0X

Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

after 1582, vec off   48197  48288 
118  2.1 482.0   1.0X
after 1582, vec on38247  38351 
128  2.6 382.5   1.3X
before 1582, vec off  53179  53359 
249  1.9 531.8   0.9X
before 1582, vec on   44076  44268 
269  2.3 440.8   1.1X
{code}

The results of the same benchmark on Spark 2.4.6-SNAPSHOT:
{code}
Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

after 158218858  18858  
 0  5.3 188.6   1.0X
before 1582   18508  18508  
 0  5.4 185.1   1.0X

Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

after 1582, vec off   14063  14177 
143  7.1 140.6   1.0X
after 1582, vec on 5955   6029 
100 16.8  59.5   2.4X
before 1582, vec off  14119  14126  
 7  7.1 141.2   1.0X
before 1582, vec on5991   6007  
25 16.7  59.9   2.3X
{code}
 Here is the PR with DateTimeRebaseBenchmark backported to 2.4: 
https://github.com/MaxGekk/spark/pull/27



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28624) make_date is inconsistent when reading from table

2020-04-10 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080312#comment-17080312
 ] 

Maxim Gekk commented on SPARK-28624:


toJavaDate is implemented differently in the master 
[https://github.com/apache/spark/blob/e2d9399602d485eae94cd530d134ebab336e9e9b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L129-L132]

> make_date is inconsistent when reading from table
> -
>
> Key: SPARK-28624
> URL: https://issues.apache.org/jira/browse/SPARK-28624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Screen Shot 2019-08-05 at 18.19.39.png, collect 
> make_date.png
>
>
> {code:sql}
> spark-sql> create table test_make_date as select make_date(-44, 3, 15) as d;
> spark-sql> select d, make_date(-44, 3, 15) from test_make_date;
> 0045-03-15-0044-03-15
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31402) Incorrect rebasing of BCE dates

2020-04-09 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31402:
--

 Summary: Incorrect rebasing of BCE dates
 Key: SPARK-31402
 URL: https://issues.apache.org/jira/browse/SPARK-31402
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Dates of before common era are rebased incorrectly, see 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120679/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql/
{code}
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
postgreSQL/date.sql
Expected "[-0044]-03-15", but got "[0045]-03-15" Result did not match for query 
#93
select make_date(-44, 3, 15)
{code}
Even such dates are out of the valid range of dates supported by the DATE type, 
there is a test in postgreSQL/date.sql for a negative year, and it would be 
nice to fix the issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31398) Speed up reading dates in ORC

2020-04-09 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31398:
--

 Summary: Speed up reading dates in ORC
 Key: SPARK-31398
 URL: https://issues.apache.org/jira/browse/SPARK-31398
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, ORC datasource converts values of DATE type to java.sql.Date and the 
result to days since the epoch in Proleptic Gregorian calendar. ORC datasource 
does such conversion when 
spark.sql.orc.enableVectorizedReader is set to false.

The conversion to java.sql.Date is not necessary because we can use 
DaysWritable which performs rebasing in much more optimal way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31385) Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing

2020-04-08 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31385:
--

 Summary: Results of Julian-Gregorian rebasing don't match to 
Gregorian-Julian rebasing
 Key: SPARK-31385
 URL: https://issues.apache.org/jira/browse/SPARK-31385
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Microseconds rebasing from the hybrid calendar (Julian + Gregorian) to 
Proleptic Gregorian calendar is not symmetric to opposite conversion for the 
following time zones:
 #  Asia/Tehran
 # Iran
 # Africa/Casablanca
 # Africa/El_Aaiun

Here is the results from the https://github.com/apache/spark/pull/28119:
Julian -> Gregorian:
{code:json}
, {
  "tz" : "Asia/Tehran",
  "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, 
-46383420600, -43227660600, -40071900600, -33760380600, -30604620600, 
-27448860600, -21137340600, -17981580600, -14825820600, -12219305400, 
-2208988800, 2547315000, 2547401400 ],
  "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, 
-518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ]
}, {
  "tz" : "Iran",
  "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, 
-46383420600, -43227660600, -40071900600, -33760380600, -30604620600, 
-27448860600, -21137340600, -17981580600, -14825820600, -12219305400, 
-2208988800, 2547315000, 2547401400 ],
  "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, 
-518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ]
}, {
  "tz" : "Africa/Casablanca",
  "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, 
-46383408000, -43227648000, -40071888000, -33760368000, -30604608000, 
-27448848000, -21137328000, -17981568000, -14825808000, -12219292800, 
-2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 
2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 
2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 
2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 
2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 
2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 
2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 
2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 
2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 
2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 
3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 
3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 
3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 
3332714400, 3335742000, 3362954400, 3366586800, 3393799200, 3396826800, 
3424644000, 3427671600, 3454884000, 3457911600, 3485728800, 3488756400, 
3515968800, 3519601200, 3546813600, 3549841200, 3577658400, 3580686000, 
3607898400, 3611530800, 3638743200, 3641770800, 3669588000, 3672615600, 
3699828000, 3702855600 ],
  "diffs" : [ 174620, 88220, 1820, -84580, -170980, -257380, -343780, -430180, 
-516580, -602980, -689380, -775780, -862180, 1820, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600 ]
}, {
  "tz" : "Africa/El_Aaiun",
  "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, 
-46383408000, -43227648000, -40071888000, -33760368000, -30604608000, 
-27448848000, -21137328000, -17981568000, -14825808000, -12219292800, 
-2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 
2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 
2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 
2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 
2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 
2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 
2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 
2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 
2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 
2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 
3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 
3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 
3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 
3332714400, 333

[jira] [Created] (SPARK-31359) Speed up timestamps rebasing

2020-04-06 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31359:
--

 Summary: Speed up timestamps rebasing
 Key: SPARK-31359
 URL: https://issues.apache.org/jira/browse/SPARK-31359
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, rebasing of timestamps is performed via conversions to local 
timestamps and back to microseconds. This is CPU intensive operation which can 
be avoid by converting via pre-calculated tables per each time zone. For 
example, the below is timestamps when diffs are changed in America/Los_Angeles 
time zone for the range 0001-01-01...2100-01-01
{code}
0001-01-01T00:00 diff = -2872 minutes
0100-03-01T00:00 diff = -1432 minutes
0200-03-01T00:00 diff = 7 minutes
0300-03-01T00:00 diff = 1447 minutes
0500-03-01T00:00 diff = 2887 minutes
0600-03-01T00:00 diff = 4327 minutes
0700-03-01T00:00 diff = 5767 minutes
0900-03-01T00:00 diff = 7207 minutes
1000-03-01T00:00 diff = 8647 minutes
1100-03-01T00:00 diff = 10087 minutes
1300-03-01T00:00 diff = 11527 minutes
1400-03-01T00:00 diff = 12967 minutes
1500-03-01T00:00 diff = 14407 minutes
1582-10-15T00:00 diff = 7 minutes
1883-11-18T12:22:58 diff = 0 minutes
{code}
It seems it is possible to build rebasing maps, and perform rebasing via the 
maps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31353) Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark

2020-04-05 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31353:
--

 Summary: Set time zone in DateTimeBenchmark and 
DateTimeRebaseBenchmark
 Key: SPARK-31353
 URL: https://issues.apache.org/jira/browse/SPARK-31353
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Performance of date-time function can depend on the system JVM time zone or SQL 
config spark.sql.session.timeZone. To avoid any fluctuations of benchmarks 
results, the ticket aims to set a time zone explicitly in date-time benchmarks 
DateTimeBenchmark and DateTimeRebaseBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31343) Check codegen does not fail on expressions with special characters in string parameters

2020-04-03 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31343:
--

 Summary: Check codegen does not fail on expressions with special 
characters in string parameters
 Key: SPARK-31343
 URL: https://issues.apache.org/jira/browse/SPARK-31343
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Add tests similar to tests added by the PR 
https://github.com/apache/spark/pull/20182 for from_utc_timestamp / 
to_utc_timestamp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time

2020-04-02 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31328:
---
Description: 
Run the following code in the *America/Los_Angeles* time zone:
{code:scala}
test("rebasing differences") {
  withDefaultTimeZone(getZoneId("America/Los_Angeles")) {
val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)
val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)

var micros = start
var diff = Long.MaxValue
var counter = 0
while (micros < end) {
  val rebased = rebaseGregorianToJulianMicros(micros)
  val curDiff = rebased - micros
  if (curDiff != diff) {
counter += 1
diff = curDiff
val ldt = 
microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
minutes")
  }
  micros += 30 * MICROS_PER_MINUTE
}
println(s"counter = $counter")
  }
}
{code}
The rebased and original micros must be the same after 1883-11-18 because the 
standard zone offset and DST offset are the same in Proleptic Gregorian 
calendar and in the hybrid calendar (Julian+Gregorian) but actually there are 
differences of 60 minutes:
{code:java}
local date-time = 0001-01-01T00:00 diff = -2872 minutes
local date-time = 0100-03-01T00:00 diff = -1432 minutes
local date-time = 0200-03-01T00:00 diff = 7 minutes
local date-time = 0300-03-01T00:00 diff = 1447 minutes
local date-time = 0500-03-01T00:00 diff = 2887 minutes
local date-time = 0600-03-01T00:00 diff = 4327 minutes
local date-time = 0700-03-01T00:00 diff = 5767 minutes
local date-time = 0900-03-01T00:00 diff = 7207 minutes
local date-time = 1000-03-01T00:00 diff = 8647 minutes
local date-time = 1100-03-01T00:00 diff = 10087 minutes
local date-time = 1300-03-01T00:00 diff = 11527 minutes
local date-time = 1400-03-01T00:00 diff = 12967 minutes
local date-time = 1500-03-01T00:00 diff = 14407 minutes
local date-time = 1582-10-15T00:00 diff = 7 minutes
local date-time = 1883-11-18T12:22:58 diff = 0 minutes
local date-time = 1918-10-27T01:22:58 diff = 60 minutes
local date-time = 1918-10-27T01:22:58 diff = 0 minutes
local date-time = 1919-10-26T01:22:58 diff = 60 minutes
local date-time = 1919-10-26T01:22:58 diff = 0 minutes
local date-time = 1945-09-30T01:22:58 diff = 60 minutes
local date-time = 1945-09-30T01:22:58 diff = 0 minutes
local date-time = 1949-01-01T01:22:58 diff = 60 minutes
local date-time = 1949-01-01T01:22:58 diff = 0 minutes
local date-time = 1950-09-24T01:22:58 diff = 60 minutes
local date-time = 1950-09-24T01:22:58 diff = 0 minutes
local date-time = 1951-09-30T01:22:58 diff = 60 minutes
local date-time = 1951-09-30T01:22:58 diff = 0 minutes
local date-time = 1952-09-28T01:22:58 diff = 60 minutes
local date-time = 1952-09-28T01:22:58 diff = 0 minutes
local date-time = 1953-09-27T01:22:58 diff = 60 minutes
local date-time = 1953-09-27T01:22:58 diff = 0 minutes
local date-time = 1954-09-26T01:22:58 diff = 60 minutes
local date-time = 1954-09-26T01:22:58 diff = 0 minutes
local date-time = 1955-09-25T01:22:58 diff = 60 minutes
local date-time = 1955-09-25T01:22:58 diff = 0 minutes
local date-time = 1956-09-30T01:22:58 diff = 60 minutes
local date-time = 1956-09-30T01:22:58 diff = 0 minutes
local date-time = 1957-09-29T01:22:58 diff = 60 minutes
local date-time = 1957-09-29T01:22:58 diff = 0 minutes
local date-time = 1958-09-28T01:22:58 diff = 60 minutes
local date-time = 1958-09-28T01:22:58 diff = 0 minutes
local date-time = 1959-09-27T01:22:58 diff = 60 minutes
local date-time = 1959-09-27T01:22:58 diff = 0 minutes
local date-time = 1960-09-25T01:22:58 diff = 60 minutes
local date-time = 1960-09-25T01:22:58 diff = 0 minutes
local date-time = 1961-09-24T01:22:58 diff = 60 minutes
local date-time = 1961-09-24T01:22:58 diff = 0 minutes
local date-time = 1962-10-28T01:22:58 diff = 60 minutes
local date-time = 1962-10-28T01:22:58 diff = 0 minutes
local date-time = 1963-10-27T01:22:58 diff = 60 minutes
local date-time = 1963-10-27T01:22:58 diff = 0 minutes
local date-time = 1964-10-25T01:22:58 diff = 60 minutes
local date-time = 1964-10-25T01:22:58 diff = 0 minutes
local date-time = 1965-10-31T01:22:58 diff = 60 minutes
local date-time = 1965-10-31T01:22:58 diff = 0 minutes
local date-time = 1966-10-30T01:22:58 diff = 60 minutes
local date-time = 1966-10-30T01:22:58 diff = 0 minutes
local date-time = 1967-10-29T01:22:58 diff = 60 minutes
local date-time = 1967-10-29T01:22:58 diff = 0 minutes
local date-time = 1968-10-27T01:22:58 diff = 60 minutes
local date-time = 1968-10-27T01:22:58 diff = 0 minutes
local date-time = 1969-10-26T01:22:58 diff = 60 minutes
local date-time = 1969-10-26T01:22:58 diff = 0 minutes
local date-time = 1970-10-25T01:22:58 di

[jira] [Updated] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time

2020-04-02 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31328:
---
Description: 
Run the following code in the *America/Los_Angeles* time zone:
{code:scala}
test("rebasing differences") {
  withDefaultTimeZone(getZoneId("America/Los_Angeles")) {
val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)
val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)

var micros = start
var diff = Long.MaxValue
var counter = 0
while (micros < end) {
  val rebased = rebaseGregorianToJulianMicros(micros)
  val curDiff = rebased - micros
  if (curDiff != diff) {
counter += 1
diff = curDiff
val ldt = 
microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
minutes")
  }
  micros += 30 * MICROS_PER_MINUTE
}
println(s"counter = $counter")
  }
}
{code}
{code:java}
local date-time = 0001-01-01T00:00 diff = -2909 minutes
local date-time = 0100-02-28T14:00 diff = -1469 minutes
local date-time = 0200-02-28T14:00 diff = -29 minutes
local date-time = 0300-02-28T14:00 diff = 1410 minutes
local date-time = 0500-02-28T14:00 diff = 2850 minutes
local date-time = 0600-02-28T14:00 diff = 4290 minutes
local date-time = 0700-02-28T14:00 diff = 5730 minutes
local date-time = 0900-02-28T14:00 diff = 7170 minutes
local date-time = 1000-02-28T14:00 diff = 8610 minutes
local date-time = 1100-02-28T14:00 diff = 10050 minutes
local date-time = 1300-02-28T14:00 diff = 11490 minutes
local date-time = 1400-02-28T14:00 diff = 12930 minutes
local date-time = 1500-02-28T14:00 diff = 14370 minutes
local date-time = 1582-10-14T14:00 diff = -29 minutes
local date-time = 1899-12-31T16:52:58 diff = 0 minutes
local date-time = 1917-12-27T11:52:58 diff = 60 minutes
local date-time = 1917-12-27T12:52:58 diff = 0 minutes
local date-time = 1918-09-15T12:52:58 diff = 60 minutes
local date-time = 1918-09-15T13:52:58 diff = 0 minutes
local date-time = 1919-06-30T16:52:58 diff = 31 minutes
local date-time = 1919-06-30T17:52:58 diff = 0 minutes
local date-time = 1919-08-15T12:52:58 diff = 60 minutes
local date-time = 1919-08-15T13:52:58 diff = 0 minutes
local date-time = 1921-08-31T10:52:58 diff = 60 minutes
local date-time = 1921-08-31T11:52:58 diff = 0 minutes
local date-time = 1921-09-30T11:52:58 diff = 60 minutes
local date-time = 1921-09-30T12:52:58 diff = 0 minutes
local date-time = 1922-09-30T12:52:58 diff = 60 minutes
local date-time = 1922-09-30T13:52:58 diff = 0 minutes
local date-time = 1981-09-30T12:52:58 diff = 60 minutes
local date-time = 1981-09-30T13:52:58 diff = 0 minutes
local date-time = 1982-09-30T12:52:58 diff = 60 minutes
local date-time = 1982-09-30T13:52:58 diff = 0 minutes
local date-time = 1983-09-30T12:52:58 diff = 60 minutes
local date-time = 1983-09-30T13:52:58 diff = 0 minutes
local date-time = 1984-09-29T15:52:58 diff = 60 minutes
local date-time = 1984-09-29T16:52:58 diff = 0 minutes
local date-time = 1985-09-28T15:52:58 diff = 60 minutes
local date-time = 1985-09-28T16:52:58 diff = 0 minutes
local date-time = 1986-09-27T15:52:58 diff = 60 minutes
local date-time = 1986-09-27T16:52:58 diff = 0 minutes
local date-time = 1987-09-26T15:52:58 diff = 60 minutes
local date-time = 1987-09-26T16:52:58 diff = 0 minutes
local date-time = 1988-09-24T15:52:58 diff = 60 minutes
local date-time = 1988-09-24T16:52:58 diff = 0 minutes
local date-time = 1989-09-23T15:52:58 diff = 60 minutes
local date-time = 1989-09-23T16:52:58 diff = 0 minutes
local date-time = 1990-09-29T15:52:58 diff = 60 minutes
local date-time = 1990-09-29T16:52:58 diff = 0 minutes
local date-time = 1991-09-28T16:52:58 diff = 60 minutes
local date-time = 1991-09-28T17:52:58 diff = 0 minutes
local date-time = 1992-09-26T15:52:58 diff = 60 minutes
local date-time = 1992-09-26T16:52:58 diff = 0 minutes
local date-time = 1993-09-25T15:52:58 diff = 60 minutes
local date-time = 1993-09-25T16:52:58 diff = 0 minutes
local date-time = 1994-09-24T15:52:58 diff = 60 minutes
local date-time = 1994-09-24T16:52:58 diff = 0 minutes
local date-time = 1995-09-23T15:52:58 diff = 60 minutes
local date-time = 1995-09-23T16:52:58 diff = 0 minutes
local date-time = 1996-10-26T15:52:58 diff = 60 minutes
local date-time = 1996-10-26T16:52:58 diff = 0 minutes
local date-time = 1997-10-25T15:52:58 diff = 60 minutes
local date-time = 1997-10-25T16:52:58 diff = 0 minutes
local date-time = 1998-10-24T15:52:58 diff = 60 minutes
local date-time = 1998-10-24T16:52:58 diff = 0 minutes
local date-time = 1999-10-30T15:52:58 diff = 60 minutes
local date-time = 1999-10-30T16:52:58 diff = 0 minutes
local date-time = 2000-10-28T15:52:58 diff = 60 minutes
local date-time

[jira] [Created] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time

2020-04-02 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31328:
--

 Summary: Incorrect timestamps rebasing on autumn daylight saving 
time
 Key: SPARK-31328
 URL: https://issues.apache.org/jira/browse/SPARK-31328
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk
Assignee: Maxim Gekk
 Fix For: 3.0.0


I do believe it is possible to speed up date-time rebasing by building a map of 
micros to diffs between original and rebased micros. And look up at the map via 
binary search.

For example, the *America/Los_Angeles* time zone has less than 100 points when 
diff changes:
{code:scala}
  test("optimize rebasing") {
val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)
val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)

var micros = start
var diff = Long.MaxValue
var counter = 0
while (micros < end) {
  val rebased = rebaseGregorianToJulianMicros(micros)
  val curDiff = rebased - micros
  if (curDiff != diff) {
counter += 1
diff = curDiff
val ldt = 
microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
minutes")
  }
  micros += MICROS_PER_HOUR
}
println(s"counter = $counter")
  }
{code}
{code:java}
local date-time = 0001-01-01T00:00 diff = -2909 minutes
local date-time = 0100-02-28T14:00 diff = -1469 minutes
local date-time = 0200-02-28T14:00 diff = -29 minutes
local date-time = 0300-02-28T14:00 diff = 1410 minutes
local date-time = 0500-02-28T14:00 diff = 2850 minutes
local date-time = 0600-02-28T14:00 diff = 4290 minutes
local date-time = 0700-02-28T14:00 diff = 5730 minutes
local date-time = 0900-02-28T14:00 diff = 7170 minutes
local date-time = 1000-02-28T14:00 diff = 8610 minutes
local date-time = 1100-02-28T14:00 diff = 10050 minutes
local date-time = 1300-02-28T14:00 diff = 11490 minutes
local date-time = 1400-02-28T14:00 diff = 12930 minutes
local date-time = 1500-02-28T14:00 diff = 14370 minutes
local date-time = 1582-10-14T14:00 diff = -29 minutes
local date-time = 1899-12-31T16:52:58 diff = 0 minutes
local date-time = 1917-12-27T11:52:58 diff = 60 minutes
local date-time = 1917-12-27T12:52:58 diff = 0 minutes
local date-time = 1918-09-15T12:52:58 diff = 60 minutes
local date-time = 1918-09-15T13:52:58 diff = 0 minutes
local date-time = 1919-06-30T16:52:58 diff = 31 minutes
local date-time = 1919-06-30T17:52:58 diff = 0 minutes
local date-time = 1919-08-15T12:52:58 diff = 60 minutes
local date-time = 1919-08-15T13:52:58 diff = 0 minutes
local date-time = 1921-08-31T10:52:58 diff = 60 minutes
local date-time = 1921-08-31T11:52:58 diff = 0 minutes
local date-time = 1921-09-30T11:52:58 diff = 60 minutes
local date-time = 1921-09-30T12:52:58 diff = 0 minutes
local date-time = 1922-09-30T12:52:58 diff = 60 minutes
local date-time = 1922-09-30T13:52:58 diff = 0 minutes
local date-time = 1981-09-30T12:52:58 diff = 60 minutes
local date-time = 1981-09-30T13:52:58 diff = 0 minutes
local date-time = 1982-09-30T12:52:58 diff = 60 minutes
local date-time = 1982-09-30T13:52:58 diff = 0 minutes
local date-time = 1983-09-30T12:52:58 diff = 60 minutes
local date-time = 1983-09-30T13:52:58 diff = 0 minutes
local date-time = 1984-09-29T15:52:58 diff = 60 minutes
local date-time = 1984-09-29T16:52:58 diff = 0 minutes
local date-time = 1985-09-28T15:52:58 diff = 60 minutes
local date-time = 1985-09-28T16:52:58 diff = 0 minutes
local date-time = 1986-09-27T15:52:58 diff = 60 minutes
local date-time = 1986-09-27T16:52:58 diff = 0 minutes
local date-time = 1987-09-26T15:52:58 diff = 60 minutes
local date-time = 1987-09-26T16:52:58 diff = 0 minutes
local date-time = 1988-09-24T15:52:58 diff = 60 minutes
local date-time = 1988-09-24T16:52:58 diff = 0 minutes
local date-time = 1989-09-23T15:52:58 diff = 60 minutes
local date-time = 1989-09-23T16:52:58 diff = 0 minutes
local date-time = 1990-09-29T15:52:58 diff = 60 minutes
local date-time = 1990-09-29T16:52:58 diff = 0 minutes
local date-time = 1991-09-28T16:52:58 diff = 60 minutes
local date-time = 1991-09-28T17:52:58 diff = 0 minutes
local date-time = 1992-09-26T15:52:58 diff = 60 minutes
local date-time = 1992-09-26T16:52:58 diff = 0 minutes
local date-time = 1993-09-25T15:52:58 diff = 60 minutes
local date-time = 1993-09-25T16:52:58 diff = 0 minutes
local date-time = 1994-09-24T15:52:58 diff = 60 minutes
local date-time = 1994-09-24T16:52:58 diff = 0 minutes
local date-time = 1995-09-23T15:52:58 diff = 60 minutes
local date-time = 1995-09-23T16:52:58 diff = 0 minutes
local date-time = 1996-10-26T15:52:58 diff = 60 minutes
local date-time = 1996-10-26T16:52:58 diff = 0 minutes
local dat

[jira] [Updated] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write

2020-03-31 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31318:
---
Parent: SPARK-30951
Issue Type: Sub-task  (was: Improvement)

> Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
> -
>
> Key: SPARK-31318
> URL: https://issues.apache.org/jira/browse/SPARK-31318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, Spark provides 2 SQL configs to control rebasing of 
> dates/timestamps in Parquet and Avro datasource: 
> spark.sql.legacy.parquet.rebaseDateTime.enabled
> spark.sql.legacy.avro.rebaseDateTime.enabled
> The configs control rebasing in read and in write. That's can be inconvenient 
> for users who want to read files saved by Spark 2.4 and earlier versions, and 
> save dates/timestamps without rebasing.
> The ticket aims to split the configs, and introduce separate SQL configs for 
> read and for write. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write

2020-03-31 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31318:
--

 Summary: Split Parquet/Avro configs for rebasing dates/timestamps 
in read and in write
 Key: SPARK-31318
 URL: https://issues.apache.org/jira/browse/SPARK-31318
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, Spark provides 2 SQL configs to control rebasing of dates/timestamps 
in Parquet and Avro datasource: 

spark.sql.legacy.parquet.rebaseDateTime.enabled

spark.sql.legacy.avro.rebaseDateTime.enabled

The configs control rebasing in read and in write. That's can be inconvenient 
for users who want to read files saved by Spark 2.4 and earlier versions, and 
save dates/timestamps without rebasing.

The ticket aims to split the configs, and introduce separate SQL configs for 
read and for write. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31311) Benchmark date-time rebasing in ORC datasource

2020-03-30 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31311:
---
Description: 
* Benchmark saving dates/timestamps before and after 1582-10-15
 * Benchmark loading dates/timestamps

  was:
* Add benchmarks for saving dates/timestamps to parquet when 
spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true
* Add bechmark for loading dates/timestamps from parquet when rebasing is on


> Benchmark date-time rebasing in ORC datasource
> --
>
> Key: SPARK-31311
> URL: https://issues.apache.org/jira/browse/SPARK-31311
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> * Benchmark saving dates/timestamps before and after 1582-10-15
>  * Benchmark loading dates/timestamps



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31311) Benchmark date-time rebasing in ORC datasource

2020-03-30 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31311:
--

 Summary: Benchmark date-time rebasing in ORC datasource
 Key: SPARK-31311
 URL: https://issues.apache.org/jira/browse/SPARK-31311
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk
Assignee: Maxim Gekk
 Fix For: 3.0.0


* Add benchmarks for saving dates/timestamps to parquet when 
spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true
* Add bechmark for loading dates/timestamps from parquet when rebasing is on



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31297) Speed-up date-time rebasing

2020-03-29 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070457#comment-17070457
 ] 

Maxim Gekk commented on SPARK-31297:


The rebasing of days doesn't depend on time zone, and has just 14 special dates:
{code:scala}
  test("optimize rebasing") {
val start = localDateToDays(LocalDate.of(1, 1, 1))
val end = localDateToDays(LocalDate.of(2030, 1, 1))

var days = start
var diff = Long.MaxValue
var counter = 0
while (days < end) {
  val rebased = rebaseGregorianToJulianDays(days)
  val curDiff = rebased - days
  if (curDiff != diff) {
counter += 1
diff = curDiff
val ld = daysToLocalDate(days)
println(s"local date = $ld days = $days diff = ${diff} days")
  }
  days += 1
}
println(s"counter = $counter")
  }
{code}
{code}
local date = 0001-01-01 days = -719162 diff = -2 days
local date = 0100-03-01 days = -682944 diff = -1 days
local date = 0200-03-01 days = -646420 diff = 0 days
local date = 0300-03-01 days = -609896 diff = 1 days
local date = 0500-03-01 days = -536847 diff = 2 days
local date = 0600-03-01 days = -500323 diff = 3 days
local date = 0700-03-01 days = -463799 diff = 4 days
local date = 0900-03-01 days = -390750 diff = 5 days
local date = 1000-03-01 days = -354226 diff = 6 days
local date = 1100-03-01 days = -317702 diff = 7 days
local date = 1300-03-01 days = -244653 diff = 8 days
local date = 1400-03-01 days = -208129 diff = 9 days
local date = 1500-03-01 days = -171605 diff = 10 days
local date = 1582-10-15 days = -141427 diff = 0 days
counter = 14
{code}

> Speed-up date-time rebasing
> ---
>
> Key: SPARK-31297
> URL: https://issues.apache.org/jira/browse/SPARK-31297
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> I do believe it is possible to speed up date-time rebasing by building a map 
> of micros to diffs between original and rebased micros. And look up at the 
> map via binary search.
> For example, the *America/Los_Angeles* time zone has less than 100 points 
> when diff changes:
> {code:scala}
>   test("optimize rebasing") {
> val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> var micros = start
> var diff = Long.MaxValue
> var counter = 0
> while (micros < end) {
>   val rebased = rebaseGregorianToJulianMicros(micros)
>   val curDiff = rebased - micros
>   if (curDiff != diff) {
> counter += 1
> diff = curDiff
> val ldt = 
> microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
> println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
> minutes")
>   }
>   micros += MICROS_PER_HOUR
> }
> println(s"counter = $counter")
>   }
> {code}
> {code:java}
> local date-time = 0001-01-01T00:00 diff = -2909 minutes
> local date-time = 0100-02-28T14:00 diff = -1469 minutes
> local date-time = 0200-02-28T14:00 diff = -29 minutes
> local date-time = 0300-02-28T14:00 diff = 1410 minutes
> local date-time = 0500-02-28T14:00 diff = 2850 minutes
> local date-time = 0600-02-28T14:00 diff = 4290 minutes
> local date-time = 0700-02-28T14:00 diff = 5730 minutes
> local date-time = 0900-02-28T14:00 diff = 7170 minutes
> local date-time = 1000-02-28T14:00 diff = 8610 minutes
> local date-time = 1100-02-28T14:00 diff = 10050 minutes
> local date-time = 1300-02-28T14:00 diff = 11490 minutes
> local date-time = 1400-02-28T14:00 diff = 12930 minutes
> local date-time = 1500-02-28T14:00 diff = 14370 minutes
> local date-time = 1582-10-14T14:00 diff = -29 minutes
> local date-time = 1899-12-31T16:52:58 diff = 0 minutes
> local date-time = 1917-12-27T11:52:58 diff = 60 minutes
> local date-time = 1917-12-27T12:52:58 diff = 0 minutes
> local date-time = 1918-09-15T12:52:58 diff = 60 minutes
> local date-time = 1918-09-15T13:52:58 diff = 0 minutes
> local date-time = 1919-06-30T16:52:58 diff = 31 minutes
> local date-time = 1919-06-30T17:52:58 diff = 0 minutes
> local date-time = 1919-08-15T12:52:58 diff = 60 minutes
> local date-time = 1919-08-15T13:52:58 diff = 0 minutes
> local date-time = 1921-08-31T10:52:58 diff = 60 minutes
> local date-time = 1921-08-31T11:52:58 diff = 0 minutes
> local date-time = 1921-09-30T11:52:58 diff = 60 minutes
> local date-time = 1921-09-30T12:52:58 diff = 0 minutes
> local date-time = 1922-09-30T12:52:58 diff = 60 minutes
> local date-time = 1922-09-30T13:52:58 diff = 0 minutes
> local date-time = 1981-09-30T12:52:58 diff = 60 minutes
> local date-t

[jira] [Commented] (SPARK-31297) Speed-up date-time rebasing

2020-03-29 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070286#comment-17070286
 ] 

Maxim Gekk commented on SPARK-31297:


[~cloud_fan] [~hyukjin.kwon] [~dongjoon] WDYT?

> Speed-up date-time rebasing
> ---
>
> Key: SPARK-31297
> URL: https://issues.apache.org/jira/browse/SPARK-31297
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> I do believe it is possible to speed up date-time rebasing by building a map 
> of micros to diffs between original and rebased micros. And look up at the 
> map via binary search.
> For example, the *America/Los_Angeles* time zone has less than 100 points 
> when diff changes:
> {code:scala}
>   test("optimize rebasing") {
> val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> var micros = start
> var diff = Long.MaxValue
> var counter = 0
> while (micros < end) {
>   val rebased = rebaseGregorianToJulianMicros(micros)
>   val curDiff = rebased - micros
>   if (curDiff != diff) {
> counter += 1
> diff = curDiff
> val ldt = 
> microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
> println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
> minutes")
>   }
>   micros += MICROS_PER_HOUR
> }
> println(s"counter = $counter")
>   }
> {code}
> {code:java}
> local date-time = 0001-01-01T00:00 diff = -2909 minutes
> local date-time = 0100-02-28T14:00 diff = -1469 minutes
> local date-time = 0200-02-28T14:00 diff = -29 minutes
> local date-time = 0300-02-28T14:00 diff = 1410 minutes
> local date-time = 0500-02-28T14:00 diff = 2850 minutes
> local date-time = 0600-02-28T14:00 diff = 4290 minutes
> local date-time = 0700-02-28T14:00 diff = 5730 minutes
> local date-time = 0900-02-28T14:00 diff = 7170 minutes
> local date-time = 1000-02-28T14:00 diff = 8610 minutes
> local date-time = 1100-02-28T14:00 diff = 10050 minutes
> local date-time = 1300-02-28T14:00 diff = 11490 minutes
> local date-time = 1400-02-28T14:00 diff = 12930 minutes
> local date-time = 1500-02-28T14:00 diff = 14370 minutes
> local date-time = 1582-10-14T14:00 diff = -29 minutes
> local date-time = 1899-12-31T16:52:58 diff = 0 minutes
> local date-time = 1917-12-27T11:52:58 diff = 60 minutes
> local date-time = 1917-12-27T12:52:58 diff = 0 minutes
> local date-time = 1918-09-15T12:52:58 diff = 60 minutes
> local date-time = 1918-09-15T13:52:58 diff = 0 minutes
> local date-time = 1919-06-30T16:52:58 diff = 31 minutes
> local date-time = 1919-06-30T17:52:58 diff = 0 minutes
> local date-time = 1919-08-15T12:52:58 diff = 60 minutes
> local date-time = 1919-08-15T13:52:58 diff = 0 minutes
> local date-time = 1921-08-31T10:52:58 diff = 60 minutes
> local date-time = 1921-08-31T11:52:58 diff = 0 minutes
> local date-time = 1921-09-30T11:52:58 diff = 60 minutes
> local date-time = 1921-09-30T12:52:58 diff = 0 minutes
> local date-time = 1922-09-30T12:52:58 diff = 60 minutes
> local date-time = 1922-09-30T13:52:58 diff = 0 minutes
> local date-time = 1981-09-30T12:52:58 diff = 60 minutes
> local date-time = 1981-09-30T13:52:58 diff = 0 minutes
> local date-time = 1982-09-30T12:52:58 diff = 60 minutes
> local date-time = 1982-09-30T13:52:58 diff = 0 minutes
> local date-time = 1983-09-30T12:52:58 diff = 60 minutes
> local date-time = 1983-09-30T13:52:58 diff = 0 minutes
> local date-time = 1984-09-29T15:52:58 diff = 60 minutes
> local date-time = 1984-09-29T16:52:58 diff = 0 minutes
> local date-time = 1985-09-28T15:52:58 diff = 60 minutes
> local date-time = 1985-09-28T16:52:58 diff = 0 minutes
> local date-time = 1986-09-27T15:52:58 diff = 60 minutes
> local date-time = 1986-09-27T16:52:58 diff = 0 minutes
> local date-time = 1987-09-26T15:52:58 diff = 60 minutes
> local date-time = 1987-09-26T16:52:58 diff = 0 minutes
> local date-time = 1988-09-24T15:52:58 diff = 60 minutes
> local date-time = 1988-09-24T16:52:58 diff = 0 minutes
> local date-time = 1989-09-23T15:52:58 diff = 60 minutes
> local date-time = 1989-09-23T16:52:58 diff = 0 minutes
> local date-time = 1990-09-29T15:52:58 diff = 60 minutes
> local date-time = 1990-09-29T16:52:58 diff = 0 minutes
> local date-time = 1991-09-28T16:52:58 diff = 60 minutes
> local date-time = 1991-09-28T17:52:58 diff = 0 minutes
> local date-time = 1992-09-26T15:52:58 diff = 60 minutes
> local date-time = 1992-09-26T16:52:58 diff = 0 minutes
> local date-time = 1993-09-25T15:52:58 diff = 60 minutes
> local date-time = 1993-09-25T16:52:

[jira] [Created] (SPARK-31297) Speed-up date-time rebasing

2020-03-29 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31297:
--

 Summary: Speed-up date-time rebasing
 Key: SPARK-31297
 URL: https://issues.apache.org/jira/browse/SPARK-31297
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


I do believe it is possible to speed up date-time rebasing by building a map of 
micros to diffs between original and rebased micros. And look up at the map via 
binary search.

For example, the *America/Los_Angeles* time zone has less than 100 points when 
diff changes:
{code:scala}
  test("optimize rebasing") {
val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)
val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
  .atZone(getZoneId("America/Los_Angeles"))
  .toInstant)

var micros = start
var diff = Long.MaxValue
var counter = 0
while (micros < end) {
  val rebased = rebaseGregorianToJulianMicros(micros)
  val curDiff = rebased - micros
  if (curDiff != diff) {
counter += 1
diff = curDiff
val ldt = 
microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
minutes")
  }
  micros += MICROS_PER_HOUR
}
println(s"counter = $counter")
  }
{code}
{code:java}
local date-time = 0001-01-01T00:00 diff = -2909 minutes
local date-time = 0100-02-28T14:00 diff = -1469 minutes
local date-time = 0200-02-28T14:00 diff = -29 minutes
local date-time = 0300-02-28T14:00 diff = 1410 minutes
local date-time = 0500-02-28T14:00 diff = 2850 minutes
local date-time = 0600-02-28T14:00 diff = 4290 minutes
local date-time = 0700-02-28T14:00 diff = 5730 minutes
local date-time = 0900-02-28T14:00 diff = 7170 minutes
local date-time = 1000-02-28T14:00 diff = 8610 minutes
local date-time = 1100-02-28T14:00 diff = 10050 minutes
local date-time = 1300-02-28T14:00 diff = 11490 minutes
local date-time = 1400-02-28T14:00 diff = 12930 minutes
local date-time = 1500-02-28T14:00 diff = 14370 minutes
local date-time = 1582-10-14T14:00 diff = -29 minutes
local date-time = 1899-12-31T16:52:58 diff = 0 minutes
local date-time = 1917-12-27T11:52:58 diff = 60 minutes
local date-time = 1917-12-27T12:52:58 diff = 0 minutes
local date-time = 1918-09-15T12:52:58 diff = 60 minutes
local date-time = 1918-09-15T13:52:58 diff = 0 minutes
local date-time = 1919-06-30T16:52:58 diff = 31 minutes
local date-time = 1919-06-30T17:52:58 diff = 0 minutes
local date-time = 1919-08-15T12:52:58 diff = 60 minutes
local date-time = 1919-08-15T13:52:58 diff = 0 minutes
local date-time = 1921-08-31T10:52:58 diff = 60 minutes
local date-time = 1921-08-31T11:52:58 diff = 0 minutes
local date-time = 1921-09-30T11:52:58 diff = 60 minutes
local date-time = 1921-09-30T12:52:58 diff = 0 minutes
local date-time = 1922-09-30T12:52:58 diff = 60 minutes
local date-time = 1922-09-30T13:52:58 diff = 0 minutes
local date-time = 1981-09-30T12:52:58 diff = 60 minutes
local date-time = 1981-09-30T13:52:58 diff = 0 minutes
local date-time = 1982-09-30T12:52:58 diff = 60 minutes
local date-time = 1982-09-30T13:52:58 diff = 0 minutes
local date-time = 1983-09-30T12:52:58 diff = 60 minutes
local date-time = 1983-09-30T13:52:58 diff = 0 minutes
local date-time = 1984-09-29T15:52:58 diff = 60 minutes
local date-time = 1984-09-29T16:52:58 diff = 0 minutes
local date-time = 1985-09-28T15:52:58 diff = 60 minutes
local date-time = 1985-09-28T16:52:58 diff = 0 minutes
local date-time = 1986-09-27T15:52:58 diff = 60 minutes
local date-time = 1986-09-27T16:52:58 diff = 0 minutes
local date-time = 1987-09-26T15:52:58 diff = 60 minutes
local date-time = 1987-09-26T16:52:58 diff = 0 minutes
local date-time = 1988-09-24T15:52:58 diff = 60 minutes
local date-time = 1988-09-24T16:52:58 diff = 0 minutes
local date-time = 1989-09-23T15:52:58 diff = 60 minutes
local date-time = 1989-09-23T16:52:58 diff = 0 minutes
local date-time = 1990-09-29T15:52:58 diff = 60 minutes
local date-time = 1990-09-29T16:52:58 diff = 0 minutes
local date-time = 1991-09-28T16:52:58 diff = 60 minutes
local date-time = 1991-09-28T17:52:58 diff = 0 minutes
local date-time = 1992-09-26T15:52:58 diff = 60 minutes
local date-time = 1992-09-26T16:52:58 diff = 0 minutes
local date-time = 1993-09-25T15:52:58 diff = 60 minutes
local date-time = 1993-09-25T16:52:58 diff = 0 minutes
local date-time = 1994-09-24T15:52:58 diff = 60 minutes
local date-time = 1994-09-24T16:52:58 diff = 0 minutes
local date-time = 1995-09-23T15:52:58 diff = 60 minutes
local date-time = 1995-09-23T16:52:58 diff = 0 minutes
local date-time = 1996-10-26T15:52:58 diff = 60 minutes
local date-time = 1996-10-26T16:52:58 diff = 0 minutes
local date-time = 1997-10-25T15:52:58 diff = 60 minutes
local date-time = 1997-10-25T16:52:58 diff =

[jira] [Updated] (SPARK-31296) Benchmark date-time rebasing in Parquet datasource

2020-03-29 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31296:
---
Summary: Benchmark date-time rebasing in Parquet datasource  (was: 
Benchmark date-time rebasing to/from Julian calendar)

> Benchmark date-time rebasing in Parquet datasource
> --
>
> Key: SPARK-31296
> URL: https://issues.apache.org/jira/browse/SPARK-31296
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> * Add benchmarks for saving dates/timestamps to parquet when 
> spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true
> * Add bechmark for loading dates/timestamps from parquet when rebasing is on



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31296) Benchmark date-time rebasing to/from Julian calendar

2020-03-29 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31296:
--

 Summary: Benchmark date-time rebasing to/from Julian calendar
 Key: SPARK-31296
 URL: https://issues.apache.org/jira/browse/SPARK-31296
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


* Add benchmarks for saving dates/timestamps to parquet when 
spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true
* Add bechmark for loading dates/timestamps from parquet when rebasing is on



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31286) Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp

2020-03-27 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31286:
---
Description: 
There are two distinct types of ID (see 
https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html):
# Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the 
same offset for all local date-times
# Geographical regions - an area where a specific set of rules for finding the 
offset from UTC/Greenwich apply

For example three-letter time zone IDs are ambitious, and depend on the locale. 
They have been already deprecated in JDK, see 
https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html :
{code}
For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such 
as "PST", "CTT", "AST") are also supported. However, their use is deprecated 
because the same abbreviation is often used for multiple time zones (for 
example, "CST" could be U.S. "Central Standard Time" and "China Standard 
Time"), and the Java platform can then only recognize one of them.
{code}

The ticket aims to specify formats of the `timeZone` option in JSON/CSV 
datasource, and the `tz` parameter of the from_utc_timestamp() and 
to_utc_timestamp() functions.


  was:
There are two distinct types of ID (see 
https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html):
# Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the 
same offset for all local date-times
# Geographical regions - an area where a specific set of rules for finding the 
offset from UTC/Greenwich apply

For example three-letter time zone IDs are ambitious, and depend on the locale. 
They have been already deprecated in JDK, see 
https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html :
{code}
For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such 
as "PST", "CTT", "AST") are also supported. However, their use is deprecated 
because the same abbreviation is often used for multiple time zones (for 
example, "CST" could be U.S. "Central Standard Time" and "China Standard 
Time"), and the Java platform can then only recognize one of them.
{code}

The ticket aims to specify formats of the SQL config 
*spark.sql.session.timeZone* in the 2 forms mentioned above.




> Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp
> -
>
> Key: SPARK-31286
> URL: https://issues.apache.org/jira/browse/SPARK-31286
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> There are two distinct types of ID (see 
> https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html):
> # Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the 
> same offset for all local date-times
> # Geographical regions - an area where a specific set of rules for finding 
> the offset from UTC/Greenwich apply
> For example three-letter time zone IDs are ambitious, and depend on the 
> locale. They have been already deprecated in JDK, see 
> https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html :
> {code}
> For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such 
> as "PST", "CTT", "AST") are also supported. However, their use is deprecated 
> because the same abbreviation is often used for multiple time zones (for 
> example, "CST" could be U.S. "Central Standard Time" and "China Standard 
> Time"), and the Java platform can then only recognize one of them.
> {code}
> The ticket aims to specify formats of the `timeZone` option in JSON/CSV 
> datasource, and the `tz` parameter of the from_utc_timestamp() and 
> to_utc_timestamp() functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31286) Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp

2020-03-27 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31286:
--

 Summary: Specify formats of time zone ID for JSON/CSV option and 
from/to_utc_timestamp
 Key: SPARK-31286
 URL: https://issues.apache.org/jira/browse/SPARK-31286
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: Maxim Gekk
Assignee: Maxim Gekk
 Fix For: 3.0.0


There are two distinct types of ID (see 
https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html):
# Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the 
same offset for all local date-times
# Geographical regions - an area where a specific set of rules for finding the 
offset from UTC/Greenwich apply

For example three-letter time zone IDs are ambitious, and depend on the locale. 
They have been already deprecated in JDK, see 
https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html :
{code}
For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such 
as "PST", "CTT", "AST") are also supported. However, their use is deprecated 
because the same abbreviation is often used for multiple time zones (for 
example, "CST" could be U.S. "Central Standard Time" and "China Standard 
Time"), and the Java platform can then only recognize one of them.
{code}

The ticket aims to specify formats of the SQL config 
*spark.sql.session.timeZone* in the 2 forms mentioned above.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31284) Check rebasing of timestamps in ORC datasource

2020-03-27 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31284:
--

 Summary: Check rebasing of timestamps in ORC datasource
 Key: SPARK-31284
 URL: https://issues.apache.org/jira/browse/SPARK-31284
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Add tests to check that timestamps saved by Spark 2.4 are loaded back by Spark 
3.0 correctly. Also add tests for timestamps rebasing in write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31277) Migrate `DateTimeTestUtils` from `TimeZone` to `ZoneId`

2020-03-26 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31277:
--

 Summary: Migrate `DateTimeTestUtils` from `TimeZone` to `ZoneId`
 Key: SPARK-31277
 URL: https://issues.apache.org/jira/browse/SPARK-31277
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, Spark SQL's date-time expressions and functions are ported on Java 8 
time API but tests still use old time APIs. In particular, DateTimeTestUtils 
exposes functions that accept only TimeZone instances. This is inconvenient, 
and CPU consuming because need to convert TimeZone instances to ZoneId 
instances via strings (zone ids). The tickets aims to replace TimeZone 
parameters of DateTimeTestUtils functions by ZoneId type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31254) `HiveResult.toHiveString` does not use the current session time zone

2020-03-25 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31254:
--

 Summary: `HiveResult.toHiveString` does not use the current 
session time zone
 Key: SPARK-31254
 URL: https://issues.apache.org/jira/browse/SPARK-31254
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, date/timestamp formatters in `HiveResult.toHiveString` are 
initialized once on instantiation of the `HiveResult` object, and pick up the 
session time zone. If the sessions time zone is changed, the formatters still 
use the previous one.

See the discussion there 
https://github.com/apache/spark/pull/23391#discussion_r397347820



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31238) Incompatible ORC dates with Spark 2.4

2020-03-24 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066427#comment-17066427
 ] 

Maxim Gekk commented on SPARK-31238:


I am working on the issue.

> Incompatible ORC dates with Spark 2.4
> -
>
> Key: SPARK-31238
> URL: https://issues.apache.org/jira/browse/SPARK-31238
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Blocker
>
> Using Spark 2.4.5, write pre-1582 date to ORC file and then read it:
> {noformat}
> $ export TZ=UTC
> $ bin/spark-shell --conf spark.sql.session.timeZone=UTC
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("select cast('1200-01-01' as date) 
> dt").write.mode("overwrite").orc("/tmp/datefile")
> scala> spark.read.orc("/tmp/datefile").show
> +--+
> |dt|
> +--+
> |1200-01-01|
> +--+
> scala> :quit
> {noformat}
> Using Spark 3.0 (branch-3.0 at commit a934142f24), read the same file:
> {noformat}
> $ export TZ=UTC
> $ bin/spark-shell --conf spark.sql.session.timeZone=UTC
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.orc("/tmp/datefile").show
> +--+
> |dt|
> +--+
> |1200-01-08|
> +--+
> scala>
> {noformat}
> Dates are off.
> Timestamps, on the other hand, appear to work as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31237) Replace 3-letter time zones by zone offsets

2020-03-24 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31237:
--

 Summary: Replace 3-letter time zones by zone offsets
 Key: SPARK-31237
 URL: https://issues.apache.org/jira/browse/SPARK-31237
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


3-letter time zones are ambitious, and have been already deprecated in JDK, see 
[https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html] . Also, 
some short names are mapped to region-based zone IDs, and don't conform to 
actual definitions. For example, the PST short name is mapped to 
America/Los_Angeles. It has different zone offsets in Java 7 and Java 8 APIs:
{code:scala}
scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-05 
23:00:00").getTime)/360.0
res11: Double = -7.0
scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 
00:00:00").getTime)/360.0
res12: Double = -7.0
scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 
01:00:00").getTime)/360.0
res13: Double = -8.0
scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 
02:00:00").getTime)/360.0
res14: Double = -8.0
scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 
03:00:00").getTime)/360.0
res15: Double = -8.0
{code}
and in Java 8 API 
https://github.com/apache/spark/pull/27980#discussion_r396287278

By definition, PST must be a constant and equals to UTC-08:00, see 
https://www.timeanddate.com/time/zones/pst

The ticket aims to replace all short time zone names by zone offsets in tests.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31232) Specify formats of `spark.sql.session.timeZone`

2020-03-24 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31232:
--

 Summary: Specify formats of `spark.sql.session.timeZone`
 Key: SPARK-31232
 URL: https://issues.apache.org/jira/browse/SPARK-31232
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: Maxim Gekk


There are two distinct types of ID (see 
https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html):
# Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the 
same offset for all local date-times
# Geographical regions - an area where a specific set of rules for finding the 
offset from UTC/Greenwich apply

For example three-letter time zone IDs are ambitious, and depend on the locale. 
They have been already deprecated in JDK, see 
https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html :
{code}
For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such 
as "PST", "CTT", "AST") are also supported. However, their use is deprecated 
because the same abbreviation is often used for multiple time zones (for 
example, "CST" could be U.S. "Central Standard Time" and "China Standard 
Time"), and the Java platform can then only recognize one of them.
{code}

The ticket aims to specify formats of the SQL config 
*spark.sql.session.timeZone* in the 2 forms mentioned above.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31212) Failure of casting the '1000-02-29' string to the date type

2020-03-22 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064550#comment-17064550
 ] 

Maxim Gekk commented on SPARK-31212:


I think it would be better to use isLeapYear of GregorianCalendar, 
[https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html#isLeapYear(int)]

There are other suspicious functions that need to review 
[https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L608-L610]

> Failure of casting the '1000-02-29' string to the date type
> ---
>
> Key: SPARK-31212
> URL: https://issues.apache.org/jira/browse/SPARK-31212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Priority: Major
>
> The '1000-02-29' is valid date in the Julian calendar used in Spark 2.4.5 for 
> dates before 1582-10-15 but casting the string to the date type fails:
> {code:scala}
> scala> val df = 
> Seq("1000-02-29").toDF("dateS").select($"dateS".cast("date").as("date"))
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> ++
> |date|
> ++
> |null|
> ++
> {code}
> Creating a dataset from java.sql.Date w/ the same input string works 
> correctly:
> {code:scala}
> scala> val df2 = 
> Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date"))
> df2: org.apache.spark.sql.DataFrame = [date: date]
> scala> df2.show
> +--+
> |  date|
> +--+
> |1000-02-29|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31221) Rebase all dates/timestamps in conversion in Java types

2020-03-22 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31221:
--

 Summary: Rebase all dates/timestamps in conversion in Java types
 Key: SPARK-31221
 URL: https://issues.apache.org/jira/browse/SPARK-31221
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, the fromJavaDate(), toJavaDate(), toJavaTimestamp() and 
fromJavaTimestamp() methods of DateTimeUtils perform rebase only dates before 
Gregorian cutover date 1582-10-15 assuming that Gregorian calendar has the same 
behavior in Java 7 and Java 8 API. The assumption is incorrect, in particular, 
in getting zone offsets, for instance:
{code:scala}
scala> java.time.ZoneId.systemDefault
res16: java.time.ZoneId = America/Los_Angeles
scala> java.sql.Timestamp.valueOf("1883-11-10 00:00:00").getTimezoneOffset / 
60.0
warning: there was one deprecation warning; re-run with -deprecation for details
res17: Double = 8.0
scala> 
java.time.ZoneId.of("America/Los_Angeles").getRules.getOffset(java.time.LocalDateTime.parse("1883-11-10T00:00:00"))
res18: java.time.ZoneOffset = -07:52:58
{code}
Java 7 is not accurate, America/Los_Angeles changed time zone shift from
{code}
-7:52:58
{code}
to
{code}
-8:00 
{code}
The ticket aims to perform rebasing for any dates/timestamps independently from 
calendar cutover date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4

2020-03-22 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064376#comment-17064376
 ] 

Maxim Gekk commented on SPARK-31183:


[~koert] The problem will be resolved soon, see 
https://github.com/apache/spark/pull/27964#issuecomment-602152201

> Incompatible Avro dates/timestamps with Spark 2.4
> -
>
> Key: SPARK-31183
> URL: https://issues.apache.org/jira/browse/SPARK-31183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Write dates/timestamps to Avro file in Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5
> {code}
> {code:scala}
> scala> 
> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-01|
> +--+
> scala> 
> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-01 01:02:03.123456|
> +--+
> {code}
> Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from 
> Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5
> {code}
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-07|
> +--+
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-07 01:09:05.123456|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31212) Failure of casting the '1000-02-29' string to the date type

2020-03-21 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064017#comment-17064017
 ] 

Maxim Gekk commented on SPARK-31212:


The isLeapYear() function in 2.4 assumes Proleptic Gregorian calendar:
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L600-L602
but actually Spark 2.4 is based on the hybrid calendar Julian+Gregorian as we 
can see at
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L513-L517

It means the following functions in DateTimeUtils return incorrect results for 
dates before Gregorian cutover days:
# getQuarter
# splitDate
# getMonth
# getDayOfMonth
# firstDayOfMonth
# dateAddMonths
# stringToTimestamp
# stringToDate
# monthsBetween
# getLastDayOfMonth

/cc [~cloud_fan] [~hyukjin.kwon]

> Failure of casting the '1000-02-29' string to the date type
> ---
>
> Key: SPARK-31212
> URL: https://issues.apache.org/jira/browse/SPARK-31212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Priority: Major
>
> The '1000-02-29' is valid date in the Julian calendar used in Spark 2.4.5 for 
> dates before 1582-10-15 but casting the string to the date type fails:
> {code:scala}
> scala> val df = 
> Seq("1000-02-29").toDF("dateS").select($"dateS".cast("date").as("date"))
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> ++
> |date|
> ++
> |null|
> ++
> {code}
> Creating a dataset from java.sql.Date w/ the same input string works 
> correctly:
> {code:scala}
> scala> val df2 = 
> Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date"))
> df2: org.apache.spark.sql.DataFrame = [date: date]
> scala> df2.show
> +--+
> |  date|
> +--+
> |1000-02-29|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31212) Failure of casting the '1000-02-29' string to the date type

2020-03-21 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31212:
--

 Summary: Failure of casting the '1000-02-29' string to the date 
type
 Key: SPARK-31212
 URL: https://issues.apache.org/jira/browse/SPARK-31212
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5
Reporter: Maxim Gekk


The '1000-02-29' is valid date in the Julian calendar used in Spark 2.4.5 for 
dates before 1582-10-15 but casting the string to the date type fails:
{code:scala}
scala> val df = 
Seq("1000-02-29").toDF("dateS").select($"dateS".cast("date").as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]

scala> df.show
++
|date|
++
|null|
++
{code}
Creating a dataset from java.sql.Date w/ the same input string works correctly:
{code:scala}
scala> val df2 = 
Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date"))
df2: org.apache.spark.sql.DataFrame = [date: date]

scala> df2.show
+--+
|  date|
+--+
|1000-02-29|
+--+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31211) Failure on loading 1000-02-29 from parquet saved by Spark 2.4.5

2020-03-21 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31211:
--

 Summary: Failure on loading 1000-02-29 from parquet saved by Spark 
2.4.5
 Key: SPARK-31211
 URL: https://issues.apache.org/jira/browse/SPARK-31211
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Save valid date in Julian calendar by Spark 2.4.5 in a leap year, for instance 
1000-02-29:
{code}
$ export TZ="America/Los_Angeles"
{code}
{code:scala}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> 
df.write.mode("overwrite").format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro_leap")

scala> val df = 
Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]

scala> df.show
+--+
|  date|
+--+
|1000-02-29|
+--+

scala> 
df.write.mode("overwrite").parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap")
{code}

Load the parquet files back by Spark 3.1.0-SNAPSHOT:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
  /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
+--+
|  date|
+--+
|1000-03-06|
+--+


scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)

scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
20/03/21 03:03:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.time.DateTimeException: Invalid date 'February 29' as '1000' is not a leap 
year
at java.time.LocalDate.create(LocalDate.java:429)
at java.time.LocalDate.of(LocalDate.java:269)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.rebaseJulianToGregorianDays(DateTimeUtils.scala:1008)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31195) Reuse days rebase functions of DateTimeUtils in DaysWritable

2020-03-19 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31195:
--

 Summary: Reuse days rebase functions of DateTimeUtils in 
DaysWritable
 Key: SPARK-31195
 URL: https://issues.apache.org/jira/browse/SPARK-31195
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


The functions rebaseJulianToGregorianDays() and rebaseGregorianToJulianDays() 
were added by the PR https://github.com/apache/spark/pull/27915. The ticket 
aims to replace similar code in org.apache.spark.sql.hive.DaysWritable by the 
functions to:
# Deduplicate code
# The functions were better tested, and cross checked by reading parquet files 
saved by Spark 2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4

2020-03-18 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061902#comment-17061902
 ] 

Maxim Gekk commented on SPARK-31183:


I am working on the issue.

> Incompatible Avro dates/timestamps with Spark 2.4
> -
>
> Key: SPARK-31183
> URL: https://issues.apache.org/jira/browse/SPARK-31183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates/timestamps to Avro file in Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5
> {code}
> {code:scala}
> scala> 
> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-01|
> +--+
> scala> 
> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-01 01:02:03.123456|
> +--+
> {code}
> Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from 
> Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5
> {code}
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-07|
> +--+
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-07 01:09:05.123456|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4

2020-03-18 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061903#comment-17061903
 ] 

Maxim Gekk commented on SPARK-31183:


[~cloud_fan] FYI

> Incompatible Avro dates/timestamps with Spark 2.4
> -
>
> Key: SPARK-31183
> URL: https://issues.apache.org/jira/browse/SPARK-31183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates/timestamps to Avro file in Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5
> {code}
> {code:scala}
> scala> 
> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-01|
> +--+
> scala> 
> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-01 01:02:03.123456|
> +--+
> {code}
> Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from 
> Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5
> {code}
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-07|
> +--+
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-07 01:09:05.123456|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4

2020-03-18 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31183:
--

 Summary: Incompatible Avro dates/timestamps with Spark 2.4
 Key: SPARK-31183
 URL: https://issues.apache.org/jira/browse/SPARK-31183
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Write dates/timestamps to Avro file in Spark 2.4.5:
{code}
$ export TZ="America/Los_Angeles"
$ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5
{code}
{code:scala}
scala> 
df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")

scala> 
spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
+--+
|date  |
+--+
|1001-01-01|
+--+


scala> 
df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")

scala> 
spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
+--+
|ts|
+--+
|1001-01-01 01:02:03.123456|
+--+
{code}

Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from Spark 
2.4.5:
{code}
$ export TZ="America/Los_Angeles"
$ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5
{code}
{code:scala}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> 
spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+--+
|date  |
+--+
|1001-01-07|
+--+

scala> 
spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
+--+
|ts|
+--+
|1001-01-07 01:09:05.123456|
+--+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31159) Incompatible Parquet dates/timestamps with Spark 2.4

2020-03-15 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059617#comment-17059617
 ] 

Maxim Gekk commented on SPARK-31159:


[~cloud_fan] FYI

> Incompatible Parquet dates/timestamps with Spark 2.4
> 
>
> Key: SPARK-31159
> URL: https://issues.apache.org/jira/browse/SPARK-31159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates/timestamps to Parquet file in Spark 2.4:
> {code}
> $ export TZ="UTC"
> $ ~/spark-2.4/bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>   /_/
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_231)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
> scala> val df = Seq(("1001-01-01", "1001-01-01 
> 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), 
> $"tsS".cast("timestamp").as("ts"))
> df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]
> scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
> scala> spark.conf.set("spark.sql.parquet.outputTimestampType", 
> "TIMESTAMP_MICROS")
> scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
> scala> 
> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
> +--+--+
> |d |ts|
> +--+--+
> |1001-01-01|1001-01-01 01:02:03.123456|
> +--+--+
> {code}
> Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool 
> prints *1001-01-07* and *1001-01-07T01:02:03.123456+*:
> {code}
> $ java -jar 
> /Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar
>  dump -m 
> ./2_4_5_micros/part-0-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet
> INT32 d
> 
> *** row group 1 of 1, values 1 to 1 ***
> value 1: R:0 D:1 V:1001-01-07
> INT64 ts
> 
> *** row group 1 of 1, values 1 to 1 ***
> value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+
> {code}
> Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but 
> different values from Spark 2.4:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
>   /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_231)
> scala> 
> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
> +--+--+
> |d |ts|
> +--+--+
> |1001-01-07|1001-01-07 01:02:03.123456|
> +--+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31159) Incompatible Parquet dates/timestamps with Spark 2.4

2020-03-15 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31159:
--

 Summary: Incompatible Parquet dates/timestamps with Spark 2.4
 Key: SPARK-31159
 URL: https://issues.apache.org/jira/browse/SPARK-31159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Write dates/timestamps to Parquet file in Spark 2.4:
{code}
$ export TZ="UTC"
$ ~/spark-2.4/bin/spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
  /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.set("spark.sql.session.timeZone", "UTC")

scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", 
"tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]

scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", 
"TIMESTAMP_MICROS")

scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
scala> 
spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
+--+--+
|d |ts|
+--+--+
|1001-01-01|1001-01-01 01:02:03.123456|
+--+--+
{code}
Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool prints 
*1001-01-07* and *1001-01-07T01:02:03.123456+*:
{code}
$ java -jar 
/Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar
 dump -m 
./2_4_5_micros/part-0-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet
INT32 d

*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:1 V:1001-01-07

INT64 ts

*** row group 1 of 1, values 1 to 1 ***
value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+
{code}
Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but 
different values from Spark 2.4:
{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
  /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
scala> 
spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
+--+--+
|d |ts|
+--+--+
|1001-01-07|1001-01-07 01:02:03.123456|
+--+--+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30565) Regression in the ORC benchmark

2020-03-11 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057619#comment-17057619
 ] 

Maxim Gekk commented on SPARK-30565:


Per [~dongjoon] , default ORC reader doesn't fully cover functionality of Hive 
ORC reader. Maybe, some users have to use the former one in some cases. 

> Regression in the ORC benchmark
> ---
>
> Key: SPARK-30565
> URL: https://issues.apache.org/jira/browse/SPARK-30565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> New benchmark results generated in the PR 
> [https://github.com/apache/spark/pull/27078] show regression ~3 times.
> Before:
> {code}
> Hive built-in ORC   520531
>8  2.0 495.8   0.6X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dL138
> After:
> {code}
> Hive built-in ORC  1761   1792
>   43  0.61679.3   0.1X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dR138



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time

2020-03-06 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31076:
--

 Summary: Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp 
via local date-time
 Key: SPARK-31076
 URL: https://issues.apache.org/jira/browse/SPARK-31076
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


By default, collect() returns java.sql.Timestamp/Date instances with offsets 
derived from internal values of Catalyst's TIMESTAMP/DATE that store 
microseconds since the epoch. The conversion from internal values to 
java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting 
the resulted values before 1582 year to strings produces timestamp/date string 
in Julian calendar. For example:
{code}
scala> sql("select date '1100-10-10'").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
{code} 

This can be fixed if internal Catalyst's values are converted to local 
date-time in Gregorian calendar,  and construct local date-time from the 
resulted year, month, ..., seconds in Julian calendar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31044) Support foldable input by `schema_of_json`

2020-03-04 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31044:
--

 Summary: Support foldable input by `schema_of_json`
 Key: SPARK-31044
 URL: https://issues.apache.org/jira/browse/SPARK-31044
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, the `schema_of_json()` function allows only string literal as the 
input. The ticket aims to support any foldable string expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30563) Regressions in Join benchmarks

2020-03-04 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051076#comment-17051076
 ] 

Maxim Gekk commented on SPARK-30563:


[~petertoth] If you think it is possible to avoid some overhead of NoOp 
datasource, please, open a PR.

> Regressions in Join benchmarks
> --
>
> Key: SPARK-30563
> URL: https://issues.apache.org/jira/browse/SPARK-30563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Regenerated benchmark results in the 
> https://github.com/apache/spark/pull/27078 shows many regressions in 
> JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see
> old results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10
> new results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10
> One of the difference in queries is using the `NoOp` datasource in new 
> queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30563) Regressions in Join benchmarks

2020-03-04 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051073#comment-17051073
 ] 

Maxim Gekk commented on SPARK-30563:


> we spend a lot of time in this loop even

The loop just forces materialization of joined rows. By df.groupBy().count(), 
you skip some steps in join, it seems. I think in most cases, users need 
results of join but not just count on top of it.

> Regressions in Join benchmarks
> --
>
> Key: SPARK-30563
> URL: https://issues.apache.org/jira/browse/SPARK-30563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Regenerated benchmark results in the 
> https://github.com/apache/spark/pull/27078 shows many regressions in 
> JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see
> old results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10
> new results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10
> One of the difference in queries is using the `NoOp` datasource in new 
> queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31025) Support foldable input by `schema_of_csv`

2020-03-03 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31025:
--

 Summary: Support foldable input by `schema_of_csv` 
 Key: SPARK-31025
 URL: https://issues.apache.org/jira/browse/SPARK-31025
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, the `schema_of_csv()` function allows only string literal as the 
input. The ticket aims to support any foldable string expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31020) Support foldable schemas by `from_csv`

2020-03-03 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31020:
---
Description: 
Currently, Spark accepts only literals or schema_of_csv w/ literal input as the 
schema parameter of from_csv. And it fails on any foldable expressions, for 
instance:
{code:sql}
spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city 
STRING', 'dpt_org_', ''));
Error in query: Schema should be specified in DDL format as a string literal or 
output of the schema_of_csv function instead of replace('dpt_org_id INT, 
dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
{code}
There are no reasons to restrict users by literals. The ticket aims to support 
any foldable schemas by from_csv().

  was:
Currently, Spark accepts only literals or schema_of_csv w/ literal input as the 
schema parameter of from_csv. And it fails on any foldable expressions, for 
instance:
{code:sql}
spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city 
STRING', 'dpt_org_', ''));
Error in query: Schema should be specified in DDL format as a string literal or 
output of the schema_of_csv function instead of replace('dpt_org_id INT, 
dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
{code}
There is reasons to restrict users by literals. The ticket aims to support any 
foldable schemas by from_csv().


> Support foldable schemas by `from_csv`
> --
>
> Key: SPARK-31020
> URL: https://issues.apache.org/jira/browse/SPARK-31020
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, Spark accepts only literals or schema_of_csv w/ literal input as 
> the schema parameter of from_csv. And it fails on any foldable expressions, 
> for instance:
> {code:sql}
> spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city 
> STRING', 'dpt_org_', ''));
> Error in query: Schema should be specified in DDL format as a string literal 
> or output of the schema_of_csv function instead of replace('dpt_org_id INT, 
> dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
> {code}
> There are no reasons to restrict users by literals. The ticket aims to 
> support any foldable schemas by from_csv().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31023) Support foldable schemas by `from_json`

2020-03-03 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31023:
--

 Summary: Support foldable schemas by `from_json`
 Key: SPARK-31023
 URL: https://issues.apache.org/jira/browse/SPARK-31023
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, Spark accepts only literals or schema_of_json w/ literal input as 
the schema parameter of from_json. And it fails on any foldable expressions, 
for instance:
{code:sql}
spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id 
INT, dpt_org_city STRING', 'dpt_org_', ''));
Error in query: Schema should be specified in DDL format as a string literal or 
output of the schema_of_json function instead of replace('dpt_org_id INT, 
dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
{code}
There are no reasons to restrict users by literals. The ticket aims to support 
any foldable schemas by from_json().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31020) Support foldable schemas by `from_csv`

2020-03-03 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31020:
--

 Summary: Support foldable schemas by `from_csv`
 Key: SPARK-31020
 URL: https://issues.apache.org/jira/browse/SPARK-31020
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, Spark accepts only literals or schema_of_csv w/ literal input as the 
schema parameter of from_csv. And it fails on any foldable expressions, for 
instance:
{code:sql}
spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city 
STRING', 'dpt_org_', ''));
Error in query: Schema should be specified in DDL format as a string literal or 
output of the schema_of_csv function instead of replace('dpt_org_id INT, 
dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
{code}
There is reasons to restrict users by literals. The ticket aims to support any 
foldable schemas by from_csv().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31005) Support time zone ids in casting strings to timestamps

2020-03-01 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31005:
--

 Summary: Support time zone ids in casting strings to timestamps
 Key: SPARK-31005
 URL: https://issues.apache.org/jira/browse/SPARK-31005
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, Spark supports only time zone offsets in the formats:
* -[h]h:[m]m
* +[h]h:[m]m
* Z

The ticket aims to support any valid time zone ids at the end of timestamp 
strings, for instance:
{code}
2015-03-18T12:03:17.123456 Europe/Moscow
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30988) Add more edge-case exercising values to stats tests

2020-02-28 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30988:
--

 Summary: Add more edge-case exercising values to stats tests
 Key: SPARK-30988
 URL: https://issues.apache.org/jira/browse/SPARK-30988
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Add more edge-cases to StatisticsCollectionTestBase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30925) Overflow/round errors in conversions of milliseconds to/from microseconds

2020-02-23 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30925:
--

 Summary: Overflow/round errors in conversions of milliseconds 
to/from microseconds
 Key: SPARK-30925
 URL: https://issues.apache.org/jira/browse/SPARK-30925
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Spark has special methods in DataTimeUtils for converting microseconds from/to 
milliseconds - `fromMillis` and `toMillis()`. The methods handle arithmetic 
overflow and round negative values. The ticket aims to review all places in 
Spark SQL where microseconds are converted from/to milliseconds, and replace 
them by util methods from DateTimeUtils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30894) The behavior of Size function should not depend on SQLConf.get

2020-02-20 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041284#comment-17041284
 ] 

Maxim Gekk commented on SPARK-30894:


I am working on it.

> The behavior of Size function should not depend on SQLConf.get
> --
>
> Key: SPARK-30894
> URL: https://issues.apache.org/jira/browse/SPARK-30894
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30892) Exclude spark.sql.variable.substitute.depth from removedSQLConfigs

2020-02-20 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30892:
---
Description: The spark.sql.variable.substitute.depth SQL config is not used 
since Spark 2.4 inclusively. By the 
[https://github.com/apache/spark/pull/27169], the config was placed to 
SQLConf.removedSQLConfigs. And as a consequence of that when an user set it 
non-default value (1 for example),  he/she will get an exception. It is 
acceptable for configs that could impact on the behavior but not for this 
particular config. Raising of such exception will just make migration to Spark 
3.0 more difficult.  (was: The spark.sql.variable.substitute.depth SQL config 
is not used since Spark 2.4 inclusively. By the 
[https://github.com/apache/spark/pull/27169], the config was placed to 
SQLConf.removedSQLConfigs. And as a consequence of that when an user set it 
non-default value (1 for example),  he/she will get an exception. It is 
acceptable for configs that could impact on the behavior but not for this 
particular config. Raising of such exception will just make migration to Spark 
more difficult.)

> Exclude spark.sql.variable.substitute.depth from removedSQLConfigs
> --
>
> Key: SPARK-30892
> URL: https://issues.apache.org/jira/browse/SPARK-30892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The spark.sql.variable.substitute.depth SQL config is not used since Spark 
> 2.4 inclusively. By the [https://github.com/apache/spark/pull/27169], the 
> config was placed to SQLConf.removedSQLConfigs. And as a consequence of that 
> when an user set it non-default value (1 for example),  he/she will get an 
> exception. It is acceptable for configs that could impact on the behavior but 
> not for this particular config. Raising of such exception will just make 
> migration to Spark 3.0 more difficult.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30892) Exclude spark.sql.variable.substitute.depth from removedSQLConfigs

2020-02-20 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30892:
--

 Summary: Exclude spark.sql.variable.substitute.depth from 
removedSQLConfigs
 Key: SPARK-30892
 URL: https://issues.apache.org/jira/browse/SPARK-30892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


The spark.sql.variable.substitute.depth SQL config is not used since Spark 2.4 
inclusively. By the [https://github.com/apache/spark/pull/27169], the config 
was placed to SQLConf.removedSQLConfigs. And as a consequence of that when an 
user set it non-default value (1 for example),  he/she will get an exception. 
It is acceptable for configs that could impact on the behavior but not for this 
particular config. Raising of such exception will just make migration to Spark 
more difficult.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30858) IntegralDivide's dataType should not depend on SQLConf.get

2020-02-18 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039433#comment-17039433
 ] 

Maxim Gekk edited comment on SPARK-30858 at 2/18/20 8:29 PM:
-

The *div* function binds on this particular expression 
[https://github.com/apache/spark/blob/919d551ddbf7575abe7fe47d4bbba62164d6d845/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L282]
 . I am not sure that we can replace it during analysis.


was (Author: maxgekk):
The *div* function binds on this particular expressions 
[https://github.com/apache/spark/blob/919d551ddbf7575abe7fe47d4bbba62164d6d845/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L282]
 . I am not sure that we can replace it during analysis.

> IntegralDivide's dataType should not depend on SQLConf.get
> --
>
> Key: SPARK-30858
> URL: https://issues.apache.org/jira/browse/SPARK-30858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Priority: Blocker
>
> {{IntegralDivide}}'s dataType depends on the value of 
> {{SQLConf.get.integralDivideReturnLong}}. This is a problem because the 
> configuration can change between different phases of planning, and this can 
> silently break a query plan which can lead to crashes or data corruption. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30858) IntegralDivide's dataType should not depend on SQLConf.get

2020-02-18 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039433#comment-17039433
 ] 

Maxim Gekk commented on SPARK-30858:


The *div* function binds on this particular expressions 
[https://github.com/apache/spark/blob/919d551ddbf7575abe7fe47d4bbba62164d6d845/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L282]
 . I am not sure that we can replace it during analysis.

> IntegralDivide's dataType should not depend on SQLConf.get
> --
>
> Key: SPARK-30858
> URL: https://issues.apache.org/jira/browse/SPARK-30858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Priority: Blocker
>
> {{IntegralDivide}}'s dataType depends on the value of 
> {{SQLConf.get.integralDivideReturnLong}}. This is a problem because the 
> configuration can change between different phases of planning, and this can 
> silently break a query plan which can lead to crashes or data corruption. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30858) IntegralDivide's dataType should not depend on SQLConf.get

2020-02-18 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039410#comment-17039410
 ] 

Maxim Gekk commented on SPARK-30858:


> This is a problem because the configuration can change between different 
> phases of planning

[~hvanhovell] Is the code below right solution for the problem?
{code:scala}
case class IntegralDivide(
  left: Expression,
  right: Expression,
  integralDivideReturnLong: Boolean) extends DivModLike {

  def this(left: Expression, right: Expression) = {
this(left, right, SQLConf.get.integralDivideReturnLong)
  }

  override def dataType: DataType = if (integralDivideReturnLong) {
LongType
  } else {
left.dataType
  }
{code}

> IntegralDivide's dataType should not depend on SQLConf.get
> --
>
> Key: SPARK-30858
> URL: https://issues.apache.org/jira/browse/SPARK-30858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Priority: Blocker
>
> {{IntegralDivide}}'s dataType depends on the value of 
> {{SQLConf.get.integralDivideReturnLong}}. This is a problem because the 
> configuration can change between different phases of planning, and this can 
> silently break a query plan which can lead to crashes or data corruption. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30869) Convert dates to/from timestamps in microseconds precision

2020-02-18 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30869:
--

 Summary: Convert dates to/from timestamps in microseconds precision
 Key: SPARK-30869
 URL: https://issues.apache.org/jira/browse/SPARK-30869
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, Spark converts dates to/from "timestamp" in millisecond precision 
but internally Catalyst's TimestampType values are stored as microseconds since 
epoch. When such conversion is needed in other date-timestamp functions like 
DateTimeUtils.monthsBetween, the function has to convert microseconds to 
milliseconds and then to days, see 
https://github.com/apache/spark/blob/06217cfded8d32962e7c54c315f8e684eb9f0999/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L577-L580
 which just brings additional overhead w/o any benefits.

In later versions, it makes sense because milliseconds can be passed to 
TimeZone.getOffset but recently Spark switched to Java 8 time API and ZoneId. 
And supporting conversions to milliseconds are not needed any more.

The ticket aims to replace millisToDays by microsToDays, and daysToMillis by 
daysToMicros in DateTimeUtils.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30865) Refactor DateTimeUtils

2020-02-17 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30865:
--

 Summary: Refactor DateTimeUtils
 Key: SPARK-30865
 URL: https://issues.apache.org/jira/browse/SPARK-30865
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


* Move TimeZoneUTC and TimeZoneGMT to DateTimeTestUtils
* Remove TimeZoneGMT because it is equal to UTC
* Use ZoneId.systemDefault() instead of defaultTimeZone().toZoneId
* Alias SQLDate & SQLTimestamp to internal types of DateType and TimestampType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30857) Wrong truncations of timestamps before the epoch to hours and days

2020-02-17 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30857:
---
Description: 
Truncations to seconds and minutes of timestamps after the epoch are correct:
{code:sql}
spark-sql> select date_trunc('HOUR', '2020-02-11 00:01:02.123'), 
date_trunc('HOUR', '2020-02-11 00:01:02.789');
2020-02-11 00:00:00 2020-02-11 00:00:00
{code}
but truncations of timestamps before the epoch are incorrect:
{code:sql}
spark-sql> select date_trunc('HOUR', '1960-02-11 00:01:02.123'), 
date_trunc('HOUR', '1960-02-11 00:01:02.789');
1960-02-11 01:00:00 1960-02-11 01:00:00
{code}
The result must be *1960-02-11 00:00:00 1960-02-11 00:00:00*

The same for the DAY level:
{code:sql}
spark-sql> select date_trunc('DAY', '1960-02-11 00:01:02.123'), 
date_trunc('DAY', '1960-02-11 00:01:02.789');
1960-02-12 00:00:00 1960-02-12 00:00:00
{code}
The result must be *1960-02-11 00:00:00 1960-02-11 00:00:00*

  was:
Truncations to seconds and minutes of timestamps after the epoch are correct:
{code:sql}
spark-sql> select date_trunc('HOUR', '2020-02-11 00:01:02.123'), 
date_trunc('HOUR', '2020-02-11 00:01:02.789');
2020-02-11 00:00:00 2020-02-11 00:00:00
{code}
but truncations of timestamps before the epoch are incorrect:
{code:sql}
spark-sql> select date_trunc('HOUR', '1960-02-11 00:01:02.123'), 
date_trunc('HOUR', '1960-02-11 00:01:02.789');
1960-02-11 01:00:00 1960-02-11 01:00:00
{code}
The result must be *1960-02-11 00:00:00 1960-02-11 00:00:00*

The same for the DAY level:
{code:sql}
spark-sql> select date_trunc('DAY', '1960-02-11 00:01:02.123'), 
date_trunc('DAY', '1960-02-11 00:01:02.789');
1960-02-12 00:00:00 1960-02-12 00:00:00
{code}
The result must be 1960-02-11 00:00:00  1960-02-11 00:00:00


> Wrong truncations of timestamps before the epoch to hours and days
> --
>
> Key: SPARK-30857
> URL: https://issues.apache.org/jira/browse/SPARK-30857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Priority: Major
>
> Truncations to seconds and minutes of timestamps after the epoch are correct:
> {code:sql}
> spark-sql> select date_trunc('HOUR', '2020-02-11 00:01:02.123'), 
> date_trunc('HOUR', '2020-02-11 00:01:02.789');
> 2020-02-11 00:00:00   2020-02-11 00:00:00
> {code}
> but truncations of timestamps before the epoch are incorrect:
> {code:sql}
> spark-sql> select date_trunc('HOUR', '1960-02-11 00:01:02.123'), 
> date_trunc('HOUR', '1960-02-11 00:01:02.789');
> 1960-02-11 01:00:00   1960-02-11 01:00:00
> {code}
> The result must be *1960-02-11 00:00:00 1960-02-11 00:00:00*
> The same for the DAY level:
> {code:sql}
> spark-sql> select date_trunc('DAY', '1960-02-11 00:01:02.123'), 
> date_trunc('DAY', '1960-02-11 00:01:02.789');
> 1960-02-12 00:00:00   1960-02-12 00:00:00
> {code}
> The result must be *1960-02-11 00:00:00 1960-02-11 00:00:00*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 1107 matches

Mail list logo