[jira] [Created] (SPARK-31669) RowEncoderSuite.encode/decode fails on 1000-02-29
Maxim Gekk created SPARK-31669: -- Summary: RowEncoderSuite.encode/decode fails on 1000-02-29 Key: SPARK-31669 URL: https://issues.apache.org/jira/browse/SPARK-31669 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Maxim Gekk Here is the failure https://github.com/apache/spark/pull/28481#issuecomment-626034381: {code} org.scalatest.exceptions.TestFailedException: schema: struct input: [null,true,127,-32768,684257610,3148440411416190456,Infinity,null,2.036236359763072870,ം뵡所碆ᚯ᧳ꒁ밯झ᧱휽⑲岫遳翁㎊륣䓵읹씶읽Â␣⪸붵끂꩖⟭䶄裻乌⇇깍뵙偁뷩셙녶퐾귘嫫䍧쩔ꆁ䠾ՠ訣췐つ亙⚓깠긄蚣꿞묌泓㘡ⵆ橾櫻膋뿽⮎㖍杘䊣臼穇붘켑镅抎灕쿿ァ쏍㤰酀旬槳鑻槸놛턌춅ꉪ陪⡉法耸郄篍㹏吡ط汢측䱣 莶婚ⳟ슿쓻̷흖〦湶ဎ銓霁叹롄ᯕ珅䅃卩慗銁묠쯟ሄ啕澻矌軈憃䑋餤I쒚ᡭ⪩⚋湐蒒ジ䝱綅媪㍉芸礮猱耳藁笲⽽壶젅溜穸⫾룚྿뇳Ѩ䍢넪谦⎠줊넳楼橨䖊ꪗꚔ鬜⋍羯ሾ삦毜뢍⛛᭟莽糸픣좖뮋撜혍牭ӎ뢂験ꆪᩉ跙㌌ᔸꦐ〷旽k텁ଘ쩧媉❛뛽뷺㱂᪭挃ቿ셾⁞邞郰홋쀘ᜍ뉿ഁ迭梽ዳ硟崤쑼놱뎬蓬覄挗뾱뉍枈懂⼞ܭ갸첟ᢍ燃Ò䦛∫㦿ᶡ랗ђ䓸쑾퀷ၓ鍖霃솄⨗얔嫚ꨵ캁큰ߢ䌵Ⓡ扛郾ꌟ䫀㑈瓺냾厌ᇗ玹띏푏㻛䰁ᤠ邰굇뷉恃ᦜ쾖戀諕돚裹聼鬽劙Ἱ䏐烗䢭뉁ꏼⴾ欆⛺坶磩̿꽦⾩綬跩玉谩嶂퇗떾心鵈짘쒸봐傱䦂殏┗ח듅宥㣠ꘙ㽟忽ଟ겚鄀梧ж䋁癫剫㠉繮੫ݽ櫌非剖䤖噹뫒圏쬧罍氒ញ梶印䶋蝗杨윇鬑䰡笤㜇梀큦먚碈蝠⒊쩔蹂ૡወ쵩襒ᇳ擴ꓙ踘짧㤫倍趯鱘剨궐ঔ⇮ᶄಂ꼉⨛插柬ᒠ뇯뉒Ⱛ돝ヘ枀冗ꈑ筚綃놪㞴滅䷀ቿ䋃絚孝⏍ɞԃ灚诔懠卮쬸υ뇺闭䆲葞颫頴渋皒夂Ⲧ蟹폊綘ꥄ悈匢觏奴둇⺮웧쭑析윘ⴉ㯒罧䔫妬滢顂⺀ᶠ洷㈋祵鼲꿓阤煳⪧耒襠Ὀ蒣尥鴴涜⭕넳ⶬꑷ㢭憾휦蘀暫줔䐳Ӏ膲뜊꾓휔⤻染肽㉟Ὲ돋⏦턝⨋噴䡧☘蟾違숶籩헺Ꮼ͵ळ⣣ੲ憋ጴ癤Ś泣ࢨ뎜뚗꽳텭⽊ꦞ⍝臬슯챑捒ᐑ薯闌巡猰恝ᘱ퇨倶掫ύ矞㹿䱟᭵ජꞥ푥✠儦慮齵ệ艝傫⠤⾿챔លͬ츂궄裐편ἵ핗곐촂Ѷ鋟ໄ櫷諩艄掽᧡輎ଇ颁굺㒔企鲺脞稯흂휾ꆲ駊㲹恾暤ź沥咺ଅᖣ嶀㱎쐢꼕㮚ⴞĒ䯭튔㹶ꯜꇙ廦㏚颿垌빫ࠣ悰흥꧆괱鈋暶ᭇ燙㐇뿜閆䩋쾽䉄ៈᵅ칇ચ厑济갺캜㤩봫껫衴㎱롺藪夞䃡㮛픳餣최껐ꮾꃼ友Ῡ磗༩ꡐ흏崋䰖牀㨊䞋ᓊ㺧ᔣꥱ룛ᚁ爥呯ᩮၥ㴳㗀籧鮶噿浦ٰ癝牻⬬䷗㽂醙ꨞੇ굾鏬⑰酚곥ℰ菁εⓤ嶐媒帊녲湙犉ܒ啹⾧孨䜸錸ஊ쐡ᾫ㊮夒䇏繍힂ᡗ奄輽섚肫쀺왗隬㨖ⲝⵙ껽狇貥෫孒톶鄜趿滃逅ꨎ䫻箚美뮣湾梠贉遚줐㞻䴳떛齿楂ᣀ䟯再ꨬ驂䉭ꇜ,[B@3f1a4861,1970-01-01,1000-02-29 10:11:12.123,null] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31665) Test parquet dictionary encoding of random dates/timestamps
Maxim Gekk created SPARK-31665: -- Summary: Test parquet dictionary encoding of random dates/timestamps Key: SPARK-31665 URL: https://issues.apache.org/jira/browse/SPARK-31665 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Currently, dictionary encoding is not tested in ParquetHadoopFsRelationSuite test "test all data types" because dates and timestamps are uniformly distributed, and dictionary encoding is not applied for the types in fact. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31662) Reading wrong dates from dictionary encoded columns in Parquet files
[ https://issues.apache.org/jira/browse/SPARK-31662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31662: --- Description: Write dates with dictionary encoding enabled to parquet files: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> :paste // Entering paste mode (ctrl-D to finish) Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") .select($"dateS".cast("date").as("date")) .repartition(1) .write .option("parquet.enable.dictionary", true) .mode("overwrite") .parquet("/Users/maximgekk/tmp/parquet-date-dict") // Exiting paste mode, now interpreting. {code} Read them back: {code:scala} scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false) +--+ |date | +--+ |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| +--+ {code} *Expected values must be 1000-01-01.* I checked that the date column is encoded by dictionary via: {code} ➜ parquet-date-dict java -jar ~/Downloads/parquet-tools-1.12.0.jar dump ./part-0-84a77214-0c8c-45e9-ac41-5ca863b9dd94-c000.snappy.parquet row group 0 date: INT32 SNAPPY DO:0 FPO:4 SZ:74/70/0.95 VC:8 ENC:BIT_PACKED,RLE,P [more]... date TV=8 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... VC:8 INT32 date *** row group 1 of 1, values 1 to 8 *** value 1: R:0 D:1 V:1001-01-07 value 2: R:0 D:1 V:1001-01-07 value 3: R:0 D:1 V:1001-01-07 value 4: R:0 D:1 V:1001-01-07 value 5: R:0 D:1 V:1001-01-07 value 6: R:0 D:1 V:1001-01-07 value 7: R:0 D:1 V:1001-01-07 value 8: R:0 D:1 V:1001-01-07 {code} was: Write dates with dictionary encoding enabled to parquet files: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> :paste // Entering paste mode (ctrl-D to finish) Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") .select($"dateS".cast("date").as("date")) .repartition(1) .write .option("parquet.enable.dictionary", true) .mode("overwrite") .parquet("/Users/maximgekk/tmp/parquet-date-dict") // Exiting paste mode, now interpreting. {code} Read them back: {code:scala} scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false) +--+ |date | +--+ |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| +--+ {code} *Expected values must be 1000-01-01.* > Reading wrong dates from dictionary encoded columns in Parquet files > > > Key: SPARK-31662 > URL: https://issues.apache.org/jira/browse/SPARK-31662 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Write dates with dictionary encoding enabled to parquet files: > {code:scala} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) > Type in expressions to have them evaluated. > Type :help for more information. > scala> > spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") > .select($"dateS".cast("date").as("date")) > .repartition(1) > .write > .option("parquet.enable.dictionary", true) > .mode("overwrite") > .parquet("/Users/maximgekk/tmp/parquet-date-dict") > // Exiting paste mode, now interpreting. > {code} > Read them back: >
[jira] [Created] (SPARK-31662) Reading wrong dates from dictionary encoded columns in Parquet files
Maxim Gekk created SPARK-31662: -- Summary: Reading wrong dates from dictionary encoded columns in Parquet files Key: SPARK-31662 URL: https://issues.apache.org/jira/browse/SPARK-31662 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Maxim Gekk Write dates with dictionary encoding enabled to parquet files: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> :paste // Entering paste mode (ctrl-D to finish) Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") .select($"dateS".cast("date").as("date")) .repartition(1) .write .option("parquet.enable.dictionary", true) .mode("overwrite") .parquet("/Users/maximgekk/tmp/parquet-date-dict") // Exiting paste mode, now interpreting. {code} Read them back: {code:scala} scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false) +--+ |date | +--+ |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| +--+ {code} *Expected values must be 1000-01-01.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31641) Incorrect days conversion by JSON legacy parser
Maxim Gekk created SPARK-31641: -- Summary: Incorrect days conversion by JSON legacy parser Key: SPARK-31641 URL: https://issues.apache.org/jira/browse/SPARK-31641 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Maxim Gekk Spark 2.4.5: {code:scala} scala> val ds = Seq("{'d': '-141704'}").toDS ds: org.apache.spark.sql.Dataset[String] = [value: string] scala> val json = spark.read.schema("d date").json(ds) json: org.apache.spark.sql.DataFrame = [d: date] scala> json.show +--+ | d| +--+ |1582-01-01| +--+ {code} Spark 3.1.0-SNAPSHOT: {code:scala} scala> val ds = Seq("{'d': '-141704'}").toDS ds: org.apache.spark.sql.Dataset[String] = [value: string] scala> val json = spark.read.schema("d date").json(ds) json: org.apache.spark.sql.DataFrame = [d: date] scala> json.show +--+ | d| +--+ |1582-01-11| +--+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
[ https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099579#comment-17099579 ] Maxim Gekk commented on SPARK-31579: [~suddhuASF] The replace floorDiv by / is trivial. Please, write a code which proofs that first of all, and post it here in a comment. /cc [~cloud_fan] [~hyukjin.kwon] The code should go over all available time zones with the step of 1 hours + jitter of a few minutes. > Replace floorDiv by / in localRebaseGregorianToJulianDays() > --- > > Key: SPARK-31579 > URL: https://issues.apache.org/jira/browse/SPARK-31579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > Labels: starter > > Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check > that for all available time zones in the range of [0001, 2100] years with the > step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv > can be replaced by /, and this should improve performance of > RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31630) Skip timestamp rebasing after 1900-01-01
Maxim Gekk created SPARK-31630: -- Summary: Skip timestamp rebasing after 1900-01-01 Key: SPARK-31630 URL: https://issues.apache.org/jira/browse/SPARK-31630 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Maxim Gekk The conversions of Catalyst's DATE/TIMESTAMPS to/from Java's types java.sql.Date/java.sql.Timestamps have almost the same implementation except addition rebasing op. If we look at switch and diffs arrays of all available time zones, we can detect that there is a time point when all diffs are 0. This is 1900-01-01 00:00:00Z. So, we can compare input micros with the time point and skip conversion for modern timestamps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31623) Benchmark rebasing of INT96 and TIMESTAMP_MILLIS timestamps in read/write
Maxim Gekk created SPARK-31623: -- Summary: Benchmark rebasing of INT96 and TIMESTAMP_MILLIS timestamps in read/write Key: SPARK-31623 URL: https://issues.apache.org/jira/browse/SPARK-31623 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Add benchmark cases to DateTimeRebaseBenchmark for: # Read/Write INT96 timestamps # Read/Write TIMESTAMP_MILLIS w/ rebasing on/off -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite
[ https://issues.apache.org/jira/browse/SPARK-31554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-31554. Resolution: Not A Problem > Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite > > > Key: SPARK-31554 > URL: https://issues.apache.org/jira/browse/SPARK-31554 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, > for example: > * https://github.com/apache/spark/pull/28328#issuecomment-618992335 > The error message: > {code} > org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error > reporting > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with > error line 'Exception in thread "main" > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: > Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' > at > org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135) > at > org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152) > at > org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152) > at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188) > at > scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192) > at > org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30) > {code} > * https://github.com/apache/spark/pull/28261#issuecomment-618950225 > * https://github.com/apache/spark/pull/28261#issuecomment-618950225 > * https://github.com/apache/spark/pull/27617#issuecomment-614318644 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
Maxim Gekk created SPARK-31579: -- Summary: Replace floorDiv by / in localRebaseGregorianToJulianDays() Key: SPARK-31579 URL: https://issues.apache.org/jira/browse/SPARK-31579 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check that for all available time zones in the range of [0001, 2100] years with the step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv can be replaced by /, and this should improve performance of RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation
[ https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092824#comment-17092824 ] Maxim Gekk commented on SPARK-31449: [~cloud_fan] [~hyukjin.kwon] I compared results of those 2 functions for all time zones with step of 1 day, and found many differences in results: {code:scala} test("Investigate the difference between JDK and Spark's time zone offset calculation") { import java.util.{Calendar, TimeZone} import sun.util.calendar.ZoneInfo def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): Long = { var guess = tz.getRawOffset // the actual offset should be calculated based on milliseconds in UTC val offset = tz.getOffset(millisLocal - guess) if (offset != guess) { guess = tz.getOffset(millisLocal - offset) if (guess != offset) { // fallback to do the reverse lookup using java.sql.Timestamp // this should only happen near the start or end of DST val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt val year = getYear(days) val month = getMonth(days) val day = getDayOfMonth(days) var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt if (millisOfDay < 0) { millisOfDay += MILLIS_PER_DAY.toInt } val seconds = (millisOfDay / 1000L).toInt val hh = seconds / 3600 val mm = seconds / 60 % 60 val ss = seconds % 60 val ms = millisOfDay % 1000 val calendar = Calendar.getInstance(tz) calendar.set(year, month - 1, day, hh, mm, ss) calendar.set(Calendar.MILLISECOND, ms) guess = (millisLocal - calendar.getTimeInMillis()).toInt } } guess } def getOffsetFromLocalMillis2(millisLocal: Long, tz: TimeZone): Long = { tz match { case zoneInfo: ZoneInfo => zoneInfo.getOffsetsByWall(millisLocal, null) case timeZone: TimeZone => timeZone.getOffset(millisLocal - timeZone.getRawOffset) } } ALL_TIMEZONES .sortBy(_.getId) .foreach { zid => withDefaultTimeZone(zid) { val start = microsToMillis(instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(zid) .toInstant)) val end = microsToMillis(instantToMicros(LocalDateTime.of(2037, 1, 1, 0, 0, 0) .atZone(zid) .toInstant)) var millis = start var step: Long = MILLIS_PER_DAY while (millis < end) { val offset1 = getOffsetFromLocalMillis(millis, TimeZone.getTimeZone(zid)) val offset2 = getOffsetFromLocalMillis2(millis, TimeZone.getTimeZone(zid)) if (offset1 != offset2) { println(s"${zid.getId} ${new Timestamp(millis)} $offset1 $offset2") } millis += step } } } } {code} {code} Africa/Algiers 1916-10-01 23:47:48.0 360 0 Africa/Algiers 1917-10-07 23:47:48.0 360 0 Africa/Algiers 1918-10-06 23:47:48.0 360 0 Africa/Algiers 1919-10-05 23:47:48.0 360 0 Africa/Algiers 1920-10-23 23:47:48.0 360 0 Africa/Algiers 1921-06-21 23:47:48.0 360 0 Africa/Algiers 1946-10-06 23:47:48.0 360 0 Africa/Algiers 1963-04-13 23:47:48.0 360 0 Africa/Algiers 1971-09-26 23:47:48.0 360 0 Africa/Algiers 1979-10-25 23:47:48.0 360 0 Africa/Ceuta 1900-01-01 00:00:00.0 360 -1276000 Africa/Ceuta 1924-10-05 00:21:16.0 360 0 Africa/Ceuta 1926-10-03 00:21:16.0 360 0 Africa/Ceuta 1927-10-02 00:21:16.0 360 0 Africa/Ceuta 1928-10-07 00:21:16.0 360 0 Africa/Sao_Tome 1899-12-31 23:33:04.0 0 -2205000 Africa/Tripoli 1952-01-01 00:07:16.0 720 360 Africa/Tripoli 1954-01-01 00:07:16.0 720 360 Africa/Tripoli 1956-01-01 00:07:16.0 720 360 Africa/Tripoli 1982-01-01 00:07:16.0 720 360 Africa/Tripoli 1982-10-01 00:07:16.0 720 360 Africa/Tripoli 1983-10-01 00:07:16.0 720 360 Africa/Tripoli 1984-10-01 00:07:16.0 720 360 Africa/Tripoli 1985-10-01 00:07:16.0 720 360 Africa/Tripoli 1986-10-03 00:07:16.0 720 360 Africa/Tripoli 1987-10-01 00:07:16.0 720 360 Africa/Tripoli 1988-10-01 00:07:16.0 720 360 Africa/Tripoli 1989-10-01 00:07:16.0 720 360 Africa/Tripoli 1996-09-30 00:07:16.0 720 360 America/Inuvik 1965-10-30 18:00:00.0 -2160 -2880 America/Iqaluit 1999-10-30 20:00:00.0 -1440 -2160 America/Pangnirtung 1999-10-30 20:00:00.0 -1440 -2160 Antarctica/Casey 1900-01-01 00:00:00.0 2880 0 Antarctica/Davis 1900-01-01 00:00:00.0 2520 0 Antarctica/Davis 2009-10-18 05:00:00.0 2520 1800 Antarctica/Davis 2011-10-28 05:00:00.0 2520 1800 Antarctica/DumontDUrville 1900-01-01 00:00:00.0 3600 0 Antarctica/Mawson 1900-01-01 00:00:00.0 1800 0 Antarctica/Syowa 1900-01-01 00:00:0
[jira] [Commented] (SPARK-31563) Failure of InSet.sql for UTF8String collection
[ https://issues.apache.org/jira/browse/SPARK-31563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092168#comment-17092168 ] Maxim Gekk commented on SPARK-31563: I am working on the issue > Failure of InSet.sql for UTF8String collection > -- > > Key: SPARK-31563 > URL: https://issues.apache.org/jira/browse/SPARK-31563 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The InSet expression works on collections of internal Catalyst's types. We > can see this in the optimization when In is replaced by InSet, and In's > collection is evaluated to internal Catalyst's values: > [https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L253-L254] > {code:scala} > if (newList.length > SQLConf.get.optimizerInSetConversionThreshold) { > val hSet = newList.map(e => e.eval(EmptyRow)) > InSet(v, HashSet() ++ hSet) > } > {code} > The code existed before the optimization > https://github.com/apache/spark/pull/25754 that made another wrong assumption > about collection types. > If InSet accepts only internal Catalyst's types, the following code shouldn't > fail: > {code:scala} > InSet(Literal("a"), Set("a", "b").map(UTF8String.fromString)).sql > {code} > but it fails with the exception: > {code} > Unsupported literal type class org.apache.spark.unsafe.types.UTF8String a > java.lang.RuntimeException: Unsupported literal type class > org.apache.spark.unsafe.types.UTF8String a > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:88) > at > org.apache.spark.sql.catalyst.expressions.InSet.$anonfun$sql$2(predicates.scala:522) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31563) Failure of InSet.sql for UTF8String collection
Maxim Gekk created SPARK-31563: -- Summary: Failure of InSet.sql for UTF8String collection Key: SPARK-31563 URL: https://issues.apache.org/jira/browse/SPARK-31563 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 3.0.0, 3.1.0 Reporter: Maxim Gekk The InSet expression works on collections of internal Catalyst's types. We can see this in the optimization when In is replaced by InSet, and In's collection is evaluated to internal Catalyst's values: [https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L253-L254] {code:scala} if (newList.length > SQLConf.get.optimizerInSetConversionThreshold) { val hSet = newList.map(e => e.eval(EmptyRow)) InSet(v, HashSet() ++ hSet) } {code} The code existed before the optimization https://github.com/apache/spark/pull/25754 that made another wrong assumption about collection types. If InSet accepts only internal Catalyst's types, the following code shouldn't fail: {code:scala} InSet(Literal("a"), Set("a", "b").map(UTF8String.fromString)).sql {code} but it fails with the exception: {code} Unsupported literal type class org.apache.spark.unsafe.types.UTF8String a java.lang.RuntimeException: Unsupported literal type class org.apache.spark.unsafe.types.UTF8String a at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:88) at org.apache.spark.sql.catalyst.expressions.InSet.$anonfun$sql$2(predicates.scala:522) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite
[ https://issues.apache.org/jira/browse/SPARK-31554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091614#comment-17091614 ] Maxim Gekk commented on SPARK-31554: [~cloud_fan] [~hyukjin.kwon] Can I we disable the flaky test till someone makes it stable? > Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite > > > Key: SPARK-31554 > URL: https://issues.apache.org/jira/browse/SPARK-31554 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, > for example: > * https://github.com/apache/spark/pull/28328#issuecomment-618992335 > The error message: > {code} > org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error > reporting > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with > error line 'Exception in thread "main" > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: > Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' > at > org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135) > at > org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152) > at > org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152) > at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188) > at > scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192) > at > org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30) > {code} > * https://github.com/apache/spark/pull/28261#issuecomment-618950225 > * https://github.com/apache/spark/pull/28261#issuecomment-618950225 > * https://github.com/apache/spark/pull/27617#issuecomment-614318644 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31554) Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite
Maxim Gekk created SPARK-31554: -- Summary: Flaky test suite org.apache.spark.sql.hive.thriftserver.CliSuite Key: SPARK-31554 URL: https://issues.apache.org/jira/browse/SPARK-31554 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk The test org.apache.spark.sql.hive.thriftserver.CliSuite fails very often, for example: * https://github.com/apache/spark/pull/28328#issuecomment-618992335 The error message: {code} org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error reporting Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Failed with error line 'Exception in thread "main" org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' at org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$4(CliSuite.scala:138) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hive.thriftserver.CliSuite.captureOutput$1(CliSuite.scala:135) at org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6(CliSuite.scala:152) at org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$runCliWithin$6$adapted(CliSuite.scala:152) at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:188) at scala.sys.process.BasicIO$.$anonfun$processFully$1$adapted(BasicIO.scala:192) at org.apache.spark.sql.test.ProcessTestUtils$ProcessOutputCapturer.run(ProcessTestUtils.scala:30) {code} * https://github.com/apache/spark/pull/28261#issuecomment-618950225 * https://github.com/apache/spark/pull/28261#issuecomment-618950225 * https://github.com/apache/spark/pull/27617#issuecomment-614318644 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31553) Wrong result of isInCollection for large collections
[ https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091490#comment-17091490 ] Maxim Gekk commented on SPARK-31553: I am working on the issue > Wrong result of isInCollection for large collections > > > Key: SPARK-31553 > URL: https://issues.apache.org/jira/browse/SPARK-31553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > If the size of a collection passed to isInCollection is bigger than > spark.sql.optimizer.inSetConversionThreshold, the method can return wrong > results for some inputs. For example: > {code:scala} > val set = (0 to 20).map(_.toString).toSet > val data = Seq("1").toDF("x") > println(set.contains("1")) > data.select($"x".isInCollection(set).as("isInCollection")).show() > {code} > {code} > true > +--+ > |isInCollection| > +--+ > | false| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31553) Wrong result of isInCollection for large collections
Maxim Gekk created SPARK-31553: -- Summary: Wrong result of isInCollection for large collections Key: SPARK-31553 URL: https://issues.apache.org/jira/browse/SPARK-31553 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Maxim Gekk If the size of a collection passed to isInCollection is bigger than spark.sql.optimizer.inSetConversionThreshold, the method can return wrong results for some inputs. For example: {code:scala} val set = (0 to 20).map(_.toString).toSet val data = Seq("1").toDF("x") println(set.contains("1")) data.select($"x".isInCollection(set).as("isInCollection")).show() {code} {code} true +--+ |isInCollection| +--+ | false| +--+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091389#comment-17091389 ] Maxim Gekk commented on SPARK-31463: Parsing itself takes 10-20%. JSON datasource spends significant time in conversions to desired types according to schema. Even if you improve performance of parsing by a few times, the total impact will be not so significant. > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation
[ https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31449: --- Summary: Investigate the difference between JDK and Spark's time zone offset calculation (was: Is there a difference between JDK and Spark's time zone offset calculation) > Investigate the difference between JDK and Spark's time zone offset > calculation > --- > > Key: SPARK-31449 > URL: https://issues.apache.org/jira/browse/SPARK-31449 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.5 >Reporter: Maxim Gekk >Priority: Major > > Spark 2.4 calculates time zone offsets from wall clock timestamp using > `DateTimeUtils.getOffsetFromLocalMillis()` (see > https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118): > {code:scala} > private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): > Long = { > var guess = tz.getRawOffset > // the actual offset should be calculated based on milliseconds in UTC > val offset = tz.getOffset(millisLocal - guess) > if (offset != guess) { > guess = tz.getOffset(millisLocal - offset) > if (guess != offset) { > // fallback to do the reverse lookup using java.sql.Timestamp > // this should only happen near the start or end of DST > val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > val year = getYear(days) > val month = getMonth(days) > val day = getDayOfMonth(days) > var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt > if (millisOfDay < 0) { > millisOfDay += MILLIS_PER_DAY.toInt > } > val seconds = (millisOfDay / 1000L).toInt > val hh = seconds / 3600 > val mm = seconds / 60 % 60 > val ss = seconds % 60 > val ms = millisOfDay % 1000 > val calendar = Calendar.getInstance(tz) > calendar.set(year, month - 1, day, hh, mm, ss) > calendar.set(Calendar.MILLISECOND, ms) > guess = (millisLocal - calendar.getTimeInMillis()).toInt > } > } > guess > } > {code} > Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see > https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801: > {code:java} > if (zone instanceof ZoneInfo) { > ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets); > } else { > int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ? > internalGet(ZONE_OFFSET) : > zone.getRawOffset(); > zone.getOffsets(millis - gmtOffset, zoneOffsets); > } > {code} > Need to investigate are there any differences in results between 2 approaches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation
[ https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31449: --- Issue Type: Improvement (was: Question) > Investigate the difference between JDK and Spark's time zone offset > calculation > --- > > Key: SPARK-31449 > URL: https://issues.apache.org/jira/browse/SPARK-31449 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 >Reporter: Maxim Gekk >Priority: Major > > Spark 2.4 calculates time zone offsets from wall clock timestamp using > `DateTimeUtils.getOffsetFromLocalMillis()` (see > https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118): > {code:scala} > private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): > Long = { > var guess = tz.getRawOffset > // the actual offset should be calculated based on milliseconds in UTC > val offset = tz.getOffset(millisLocal - guess) > if (offset != guess) { > guess = tz.getOffset(millisLocal - offset) > if (guess != offset) { > // fallback to do the reverse lookup using java.sql.Timestamp > // this should only happen near the start or end of DST > val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > val year = getYear(days) > val month = getMonth(days) > val day = getDayOfMonth(days) > var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt > if (millisOfDay < 0) { > millisOfDay += MILLIS_PER_DAY.toInt > } > val seconds = (millisOfDay / 1000L).toInt > val hh = seconds / 3600 > val mm = seconds / 60 % 60 > val ss = seconds % 60 > val ms = millisOfDay % 1000 > val calendar = Calendar.getInstance(tz) > calendar.set(year, month - 1, day, hh, mm, ss) > calendar.set(Calendar.MILLISECOND, ms) > guess = (millisLocal - calendar.getTimeInMillis()).toInt > } > } > guess > } > {code} > Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see > https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801: > {code:java} > if (zone instanceof ZoneInfo) { > ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets); > } else { > int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ? > internalGet(ZONE_OFFSET) : > zone.getRawOffset(); > zone.getOffsets(millis - gmtOffset, zoneOffsets); > } > {code} > Need to investigate are there any differences in results between 2 approaches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31490) Benchmark conversions to/from Java 8 date-time types
Maxim Gekk created SPARK-31490: -- Summary: Benchmark conversions to/from Java 8 date-time types Key: SPARK-31490 URL: https://issues.apache.org/jira/browse/SPARK-31490 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Maxim Gekk DATE and TIMESTAMP column values can be converted to java.sql.Date and java.sql.Timestamp (by default), or to Java 8 date-time types java.time.LocalDate and java.time.Instant when spark.sql.datetime.java8API.enabled is set to true. DateTimeBenchmarks misses benchmarks of Java 8 date/timestamps. The ticket aims to fix that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31489) Failure on pushing down filters with java.time.LocalDate values in ORC
Maxim Gekk created SPARK-31489: -- Summary: Failure on pushing down filters with java.time.LocalDate values in ORC Key: SPARK-31489 URL: https://issues.apache.org/jira/browse/SPARK-31489 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.0.1 Reporter: Maxim Gekk When spark.sql.datetime.java8API.enabled is set to true, filters pushed down with java.time.LocalDate values to ORC datasource fails with the exception: {code} Wrong value class java.time.LocalDate for DATE.EQUALS leaf java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for DATE.EQUALS leaf at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.(SearchArgumentImpl.java:75) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352) at org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31488) Support `java.time.LocalDate` in Parquet filter pushdown
[ https://issues.apache.org/jira/browse/SPARK-31488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31488: --- Description: Currently, ParquetFilters supports only java.sql.Date values of DateType, and explicitly casts Any to java.sql.Date, see https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176 So, any filters refer to date values are not pushed down to Parquet when spark.sql.datetime.java8API.enabled is true. was: Currently, ParquetFilters supports only java.sql.Date values of DateType, and explicitly casts Any to java.sql.Date, see https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176 The code fails with an exception when spark.sql.datetime.java8API.enabled is true. > Support `java.time.LocalDate` in Parquet filter pushdown > > > Key: SPARK-31488 > URL: https://issues.apache.org/jira/browse/SPARK-31488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, ParquetFilters supports only java.sql.Date values of DateType, and > explicitly casts Any to java.sql.Date, see > https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176 > So, any filters refer to date values are not pushed down to Parquet when > spark.sql.datetime.java8API.enabled is true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31488) Support `java.time.LocalDate` in Parquet filter pushdown
Maxim Gekk created SPARK-31488: -- Summary: Support `java.time.LocalDate` in Parquet filter pushdown Key: SPARK-31488 URL: https://issues.apache.org/jira/browse/SPARK-31488 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Maxim Gekk Currently, ParquetFilters supports only java.sql.Date values of DateType, and explicitly casts Any to java.sql.Date, see https://github.com/apache/spark/blob/cb0db213736de5c5c02b09a2d5c3e17254708ce1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L176 The code fails with an exception when spark.sql.datetime.java8API.enabled is true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31471) Add a script to run multiple benchmarks
Maxim Gekk created SPARK-31471: -- Summary: Add a script to run multiple benchmarks Key: SPARK-31471 URL: https://issues.apache.org/jira/browse/SPARK-31471 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Add a python script to run multiple benchmarks. The script can be taken from [https://github.com/apache/spark/pull/27078] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084308#comment-17084308 ] Maxim Gekk commented on SPARK-31423: [~bersprockets] I think we should take the next valid date for any not-existed dates, see the linked PR. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31449) Is there a difference between JDK and Spark's time zone offset calculation
Maxim Gekk created SPARK-31449: -- Summary: Is there a difference between JDK and Spark's time zone offset calculation Key: SPARK-31449 URL: https://issues.apache.org/jira/browse/SPARK-31449 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.4.5 Reporter: Maxim Gekk Spark 2.4 calculates time zone offsets from wall clock timestamp using `DateTimeUtils.getOffsetFromLocalMillis()` (see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118): {code:scala} private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): Long = { var guess = tz.getRawOffset // the actual offset should be calculated based on milliseconds in UTC val offset = tz.getOffset(millisLocal - guess) if (offset != guess) { guess = tz.getOffset(millisLocal - offset) if (guess != offset) { // fallback to do the reverse lookup using java.sql.Timestamp // this should only happen near the start or end of DST val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt val year = getYear(days) val month = getMonth(days) val day = getDayOfMonth(days) var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt if (millisOfDay < 0) { millisOfDay += MILLIS_PER_DAY.toInt } val seconds = (millisOfDay / 1000L).toInt val hh = seconds / 3600 val mm = seconds / 60 % 60 val ss = seconds % 60 val ms = millisOfDay % 1000 val calendar = Calendar.getInstance(tz) calendar.set(year, month - 1, day, hh, mm, ss) calendar.set(Calendar.MILLISECOND, ms) guess = (millisLocal - calendar.getTimeInMillis()).toInt } } guess } {code} Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801: {code:java} if (zone instanceof ZoneInfo) { ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets); } else { int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ? internalGet(ZONE_OFFSET) : zone.getRawOffset(); zone.getOffsets(millis - gmtOffset, zoneOffsets); } {code} Need to investigate are there any differences in results between 2 approaches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083595#comment-17083595 ] Maxim Gekk commented on SPARK-31423: I have debugged this slightly on Spark 2.4, so, '1582-10-14' falls to the case while parsing from UTF8String: https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2762-L2768 {code:java} // The date is in a "missing" period. if (!isLenient()) { throw new IllegalArgumentException("the specified date doesn't exist"); } // Take the Julian date for compatibility, which // will produce a Gregorian date. fixedDate = jfd; {code} In the strong mode, the code https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L517 would throw the exception: {code} throw new IllegalArgumentException("the specified date doesn't exist") {code} but we are in the "weak" mode, in this way Java 7 GregorianCalendar interprets the date especially: {code} // Take the Julian date for compatibility, which // will produce a Gregorian date. {code} The date '1582-10-14' doesn't exist in the hybrid calendar used by Java 7 time API. It is questionable how to handle the date in such calendar. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should
[jira] [Resolved] (SPARK-31445) Avoid floating-point division in millisToDays
[ https://issues.apache.org/jira/browse/SPARK-31445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-31445. Resolution: Won't Fix > Avoid floating-point division in millisToDays > - > > Key: SPARK-31445 > URL: https://issues.apache.org/jira/browse/SPARK-31445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 >Reporter: Maxim Gekk >Priority: Minor > > As the benchmark https://github.com/MaxGekk/spark/pull/27, and comparison to > Spark 3.0+an optimisation of fromJavaDate in > https://github.com/apache/spark/pull/28205 show that floating-point ops in > millisToDays badly impact on the performance of conversion java.sql.Date to > Catalyst's date values. The ticket aims to replace double ops by int/long ops. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31445) Avoid floating-point division in millisToDays
Maxim Gekk created SPARK-31445: -- Summary: Avoid floating-point division in millisToDays Key: SPARK-31445 URL: https://issues.apache.org/jira/browse/SPARK-31445 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.5 Reporter: Maxim Gekk As the benchmark https://github.com/MaxGekk/spark/pull/27, and comparison to Spark 3.0+an optimisation of fromJavaDate in https://github.com/apache/spark/pull/28205 show that floating-point ops in millisToDays badly impact on the performance of conversion java.sql.Date to Catalyst's date values. The ticket aims to replace double ops by int/long ops. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083314#comment-17083314 ] Maxim Gekk commented on SPARK-31423: I am working on the issue. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31443) Perf regression of toJavaDate
[ https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083217#comment-17083217 ] Maxim Gekk edited comment on SPARK-31443 at 4/14/20, 1:21 PM: -- FYI [~cloud_fan] I got the numbers on the master without https://github.com/apache/spark/pull/28205 was (Author: maxgekk): FYI [~cloud_fan] > Perf regression of toJavaDate > - > > Key: SPARK-31443 > URL: https://issues.apache.org/jira/browse/SPARK-31443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > DateTimeBenchmark shows the regression > Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27] > {code:java} > OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux > 4.15.0-1063-aws > Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz > To/from Java's date-time: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > From java.sql.Date 559603 > 38 8.9 111.8 1.0X > Collect dates 2306 3221 > 1558 2.2 461.1 0.2X > {code} > Current master: > {code:java} > OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux > 4.15.0-1063-aws > Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz > To/from Java's date-time: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > From java.sql.Date 1052 1130 > 73 4.8 210.3 1.0X > Collect dates 3251 4943 > 1624 1.5 650.2 0.3X > {code} > If we subtract preparing DATE column: > * Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row > * master is (650.2 - 210.3) = 439 ns/row > The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - > 349.3)/349.3 = 25% -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31443) Perf regression of toJavaDate
[ https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083217#comment-17083217 ] Maxim Gekk commented on SPARK-31443: FYI [~cloud_fan] > Perf regression of toJavaDate > - > > Key: SPARK-31443 > URL: https://issues.apache.org/jira/browse/SPARK-31443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > DateTimeBenchmark shows the regression > Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27] > {code:java} > OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux > 4.15.0-1063-aws > Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz > To/from Java's date-time: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > From java.sql.Date 559603 > 38 8.9 111.8 1.0X > Collect dates 2306 3221 > 1558 2.2 461.1 0.2X > {code} > Current master: > {code:java} > OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux > 4.15.0-1063-aws > Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz > To/from Java's date-time: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > From java.sql.Date 1052 1130 > 73 4.8 210.3 1.0X > Collect dates 3251 4943 > 1624 1.5 650.2 0.3X > {code} > If we subtract preparing DATE column: > * Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row > * master is (650.2 - 210.3) = 439 ns/row > The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - > 349.3)/349.3 = 25% -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31443) Perf regression of toJavaDate
[ https://issues.apache.org/jira/browse/SPARK-31443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31443: --- Description: DateTimeBenchmark shows the regression Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27] {code:java} OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative >From java.sql.Date 559603 > 38 8.9 111.8 1.0X Collect dates 2306 3221 1558 2.2 461.1 0.2X {code} Current master: {code:java} OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative >From java.sql.Date 1052 1130 > 73 4.8 210.3 1.0X Collect dates 3251 4943 1624 1.5 650.2 0.3X {code} If we subtract preparing DATE column: * Spark 2.4.6-SNAPSHOT is (461.1 - 111.8) = 349.3 ns/row * master is (650.2 - 210.3) = 439 ns/row The regression of toJavaDate in master against Spark 2.4.6-SNAPSHOT is (439 - 349.3)/349.3 = 25% was: DateTimeBenchmark shows the regression Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27 {code} Conversion from/to external types OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative >From java.sql.Date 614655 > 43 8.1 122.8 1.0X {code} Current master: {code} Conversion from/to external types OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative >From java.sql.Date 1154 1206 > 46 4.3 230.9 1.0X {code} The regression is ~x2. > Perf regression of toJavaDate > - > > Key: SPARK-31443 > URL: https://issues.apache.org/jira/browse/SPARK-31443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > DateTimeBenchmark shows the regression > Spark 2.4.6-SNAPSHOT at the PR [https://github.com/MaxGekk/spark/pull/27] > {code:java} > OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux > 4.15.0-1063-aws > Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz > To/from Java's date-time: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > From java.sql.Date 559603 > 38 8.9 111.8 1.0X > Collect dates 2306 3221 > 1558 2.2 461.1 0.2X > {code} > Current master: > {code:java} > OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux > 4.15.0-1063-aws > Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz > To/from Java's date-time: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > --
[jira] [Created] (SPARK-31443) Perf regression of toJavaDate
Maxim Gekk created SPARK-31443: -- Summary: Perf regression of toJavaDate Key: SPARK-31443 URL: https://issues.apache.org/jira/browse/SPARK-31443 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk DateTimeBenchmark shows the regression Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27 {code} Conversion from/to external types OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative >From java.sql.Date 614655 > 43 8.1 122.8 1.0X {code} Current master: {code} Conversion from/to external types OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative >From java.sql.Date 1154 1206 > 46 4.3 230.9 1.0X {code} The regression is ~x2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31439) Perf regression of fromJavaDate
Maxim Gekk created SPARK-31439: -- Summary: Perf regression of fromJavaDate Key: SPARK-31439 URL: https://issues.apache.org/jira/browse/SPARK-31439 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk DateTimeBenchmark shows the regression Spark 2.4.6-SNAPSHOT at the PR https://github.com/MaxGekk/spark/pull/27 {code} Conversion from/to external types OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative >From java.sql.Date 614655 > 43 8.1 122.8 1.0X {code} Current master: {code} Conversion from/to external types OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative >From java.sql.Date 1154 1206 > 46 4.3 230.9 1.0X {code} The regression is ~x2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files
[ https://issues.apache.org/jira/browse/SPARK-31426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31426: --- Parent: SPARK-31404 Issue Type: Sub-task (was: Bug) > Regression in loading/saving timestamps from/to ORC files > - > > Key: SPARK-31426 > URL: https://issues.apache.org/jira/browse/SPARK-31426 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Here are results of DateTimeRebaseBenchmark on the current master branch: > {code} > Save timestamps to ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 158259877 59877 >0 1.7 598.8 0.0X > before 1582 61361 61361 >0 1.6 613.6 0.0X > Load timestamps from ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 1582, vec off 48197 48288 > 118 2.1 482.0 1.0X > after 1582, vec on38247 38351 > 128 2.6 382.5 1.3X > before 1582, vec off 53179 53359 > 249 1.9 531.8 0.9X > before 1582, vec on 44076 44268 > 269 2.3 440.8 1.1X > {code} > The results of the same benchmark on Spark 2.4.6-SNAPSHOT: > {code} > Save timestamps to ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 158218858 18858 >0 5.3 188.6 1.0X > before 1582 18508 18508 >0 5.4 185.1 1.0X > Load timestamps from ORC: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > after 1582, vec off 14063 14177 > 143 7.1 140.6 1.0X > after 1582, vec on 5955 6029 > 100 16.8 59.5 2.4X > before 1582, vec off 14119 14126 >7 7.1 141.2 1.0X > before 1582, vec on5991 6007 > 25 16.7 59.9 2.3X > {code} > Here is the PR with DateTimeRebaseBenchmark backported to 2.4: > https://github.com/MaxGekk/spark/pull/27 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31423: --- Comment: was deleted (was: This is intentional behavior because ORC format assumes the hybrid calendar (Julian + Gregorian) but Parquet and Avro assume Proleptic Gregorian calendar. See https://issues.apache.org/jira/browse/SPARK-30951) > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082051#comment-17082051 ] Maxim Gekk commented on SPARK-31423: This is intentional behavior because ORC format assumes the hybrid calendar (Julian + Gregorian) but Parquet and Avro assume Proleptic Gregorian calendar. See https://issues.apache.org/jira/browse/SPARK-30951 > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files
Maxim Gekk created SPARK-31426: -- Summary: Regression in loading/saving timestamps from/to ORC files Key: SPARK-31426 URL: https://issues.apache.org/jira/browse/SPARK-31426 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Here are results of DateTimeRebaseBenchmark on the current master branch: {code} Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative after 158259877 59877 0 1.7 598.8 0.0X before 1582 61361 61361 0 1.6 613.6 0.0X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative after 1582, vec off 48197 48288 118 2.1 482.0 1.0X after 1582, vec on38247 38351 128 2.6 382.5 1.3X before 1582, vec off 53179 53359 249 1.9 531.8 0.9X before 1582, vec on 44076 44268 269 2.3 440.8 1.1X {code} The results of the same benchmark on Spark 2.4.6-SNAPSHOT: {code} Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative after 158218858 18858 0 5.3 188.6 1.0X before 1582 18508 18508 0 5.4 185.1 1.0X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative after 1582, vec off 14063 14177 143 7.1 140.6 1.0X after 1582, vec on 5955 6029 100 16.8 59.5 2.4X before 1582, vec off 14119 14126 7 7.1 141.2 1.0X before 1582, vec on5991 6007 25 16.7 59.9 2.3X {code} Here is the PR with DateTimeRebaseBenchmark backported to 2.4: https://github.com/MaxGekk/spark/pull/27 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28624) make_date is inconsistent when reading from table
[ https://issues.apache.org/jira/browse/SPARK-28624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080312#comment-17080312 ] Maxim Gekk commented on SPARK-28624: toJavaDate is implemented differently in the master [https://github.com/apache/spark/blob/e2d9399602d485eae94cd530d134ebab336e9e9b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L129-L132] > make_date is inconsistent when reading from table > - > > Key: SPARK-28624 > URL: https://issues.apache.org/jira/browse/SPARK-28624 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: Screen Shot 2019-08-05 at 18.19.39.png, collect > make_date.png > > > {code:sql} > spark-sql> create table test_make_date as select make_date(-44, 3, 15) as d; > spark-sql> select d, make_date(-44, 3, 15) from test_make_date; > 0045-03-15-0044-03-15 > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31402) Incorrect rebasing of BCE dates
Maxim Gekk created SPARK-31402: -- Summary: Incorrect rebasing of BCE dates Key: SPARK-31402 URL: https://issues.apache.org/jira/browse/SPARK-31402 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Dates of before common era are rebased incorrectly, see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120679/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql/ {code} sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: postgreSQL/date.sql Expected "[-0044]-03-15", but got "[0045]-03-15" Result did not match for query #93 select make_date(-44, 3, 15) {code} Even such dates are out of the valid range of dates supported by the DATE type, there is a test in postgreSQL/date.sql for a negative year, and it would be nice to fix the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31398) Speed up reading dates in ORC
Maxim Gekk created SPARK-31398: -- Summary: Speed up reading dates in ORC Key: SPARK-31398 URL: https://issues.apache.org/jira/browse/SPARK-31398 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, ORC datasource converts values of DATE type to java.sql.Date and the result to days since the epoch in Proleptic Gregorian calendar. ORC datasource does such conversion when spark.sql.orc.enableVectorizedReader is set to false. The conversion to java.sql.Date is not necessary because we can use DaysWritable which performs rebasing in much more optimal way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31385) Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing
Maxim Gekk created SPARK-31385: -- Summary: Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing Key: SPARK-31385 URL: https://issues.apache.org/jira/browse/SPARK-31385 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Microseconds rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar is not symmetric to opposite conversion for the following time zones: # Asia/Tehran # Iran # Africa/Casablanca # Africa/El_Aaiun Here is the results from the https://github.com/apache/spark/pull/28119: Julian -> Gregorian: {code:json} , { "tz" : "Asia/Tehran", "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, -2208988800, 2547315000, 2547401400 ], "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ] }, { "tz" : "Iran", "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, -2208988800, 2547315000, 2547401400 ], "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ] }, { "tz" : "Africa/Casablanca", "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 3332714400, 3335742000, 3362954400, 3366586800, 3393799200, 3396826800, 3424644000, 3427671600, 3454884000, 3457911600, 3485728800, 3488756400, 3515968800, 3519601200, 3546813600, 3549841200, 3577658400, 3580686000, 3607898400, 3611530800, 3638743200, 3641770800, 3669588000, 3672615600, 3699828000, 3702855600 ], "diffs" : [ 174620, 88220, 1820, -84580, -170980, -257380, -343780, -430180, -516580, -602980, -689380, -775780, -862180, 1820, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600 ] }, { "tz" : "Africa/El_Aaiun", "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 3332714400, 333
[jira] [Created] (SPARK-31359) Speed up timestamps rebasing
Maxim Gekk created SPARK-31359: -- Summary: Speed up timestamps rebasing Key: SPARK-31359 URL: https://issues.apache.org/jira/browse/SPARK-31359 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, rebasing of timestamps is performed via conversions to local timestamps and back to microseconds. This is CPU intensive operation which can be avoid by converting via pre-calculated tables per each time zone. For example, the below is timestamps when diffs are changed in America/Los_Angeles time zone for the range 0001-01-01...2100-01-01 {code} 0001-01-01T00:00 diff = -2872 minutes 0100-03-01T00:00 diff = -1432 minutes 0200-03-01T00:00 diff = 7 minutes 0300-03-01T00:00 diff = 1447 minutes 0500-03-01T00:00 diff = 2887 minutes 0600-03-01T00:00 diff = 4327 minutes 0700-03-01T00:00 diff = 5767 minutes 0900-03-01T00:00 diff = 7207 minutes 1000-03-01T00:00 diff = 8647 minutes 1100-03-01T00:00 diff = 10087 minutes 1300-03-01T00:00 diff = 11527 minutes 1400-03-01T00:00 diff = 12967 minutes 1500-03-01T00:00 diff = 14407 minutes 1582-10-15T00:00 diff = 7 minutes 1883-11-18T12:22:58 diff = 0 minutes {code} It seems it is possible to build rebasing maps, and perform rebasing via the maps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31353) Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark
Maxim Gekk created SPARK-31353: -- Summary: Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark Key: SPARK-31353 URL: https://issues.apache.org/jira/browse/SPARK-31353 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Performance of date-time function can depend on the system JVM time zone or SQL config spark.sql.session.timeZone. To avoid any fluctuations of benchmarks results, the ticket aims to set a time zone explicitly in date-time benchmarks DateTimeBenchmark and DateTimeRebaseBenchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31343) Check codegen does not fail on expressions with special characters in string parameters
Maxim Gekk created SPARK-31343: -- Summary: Check codegen does not fail on expressions with special characters in string parameters Key: SPARK-31343 URL: https://issues.apache.org/jira/browse/SPARK-31343 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Add tests similar to tests added by the PR https://github.com/apache/spark/pull/20182 for from_utc_timestamp / to_utc_timestamp -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time
[ https://issues.apache.org/jira/browse/SPARK-31328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31328: --- Description: Run the following code in the *America/Los_Angeles* time zone: {code:scala} test("rebasing differences") { withDefaultTimeZone(getZoneId("America/Los_Angeles")) { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += 30 * MICROS_PER_MINUTE } println(s"counter = $counter") } } {code} The rebased and original micros must be the same after 1883-11-18 because the standard zone offset and DST offset are the same in Proleptic Gregorian calendar and in the hybrid calendar (Julian+Gregorian) but actually there are differences of 60 minutes: {code:java} local date-time = 0001-01-01T00:00 diff = -2872 minutes local date-time = 0100-03-01T00:00 diff = -1432 minutes local date-time = 0200-03-01T00:00 diff = 7 minutes local date-time = 0300-03-01T00:00 diff = 1447 minutes local date-time = 0500-03-01T00:00 diff = 2887 minutes local date-time = 0600-03-01T00:00 diff = 4327 minutes local date-time = 0700-03-01T00:00 diff = 5767 minutes local date-time = 0900-03-01T00:00 diff = 7207 minutes local date-time = 1000-03-01T00:00 diff = 8647 minutes local date-time = 1100-03-01T00:00 diff = 10087 minutes local date-time = 1300-03-01T00:00 diff = 11527 minutes local date-time = 1400-03-01T00:00 diff = 12967 minutes local date-time = 1500-03-01T00:00 diff = 14407 minutes local date-time = 1582-10-15T00:00 diff = 7 minutes local date-time = 1883-11-18T12:22:58 diff = 0 minutes local date-time = 1918-10-27T01:22:58 diff = 60 minutes local date-time = 1918-10-27T01:22:58 diff = 0 minutes local date-time = 1919-10-26T01:22:58 diff = 60 minutes local date-time = 1919-10-26T01:22:58 diff = 0 minutes local date-time = 1945-09-30T01:22:58 diff = 60 minutes local date-time = 1945-09-30T01:22:58 diff = 0 minutes local date-time = 1949-01-01T01:22:58 diff = 60 minutes local date-time = 1949-01-01T01:22:58 diff = 0 minutes local date-time = 1950-09-24T01:22:58 diff = 60 minutes local date-time = 1950-09-24T01:22:58 diff = 0 minutes local date-time = 1951-09-30T01:22:58 diff = 60 minutes local date-time = 1951-09-30T01:22:58 diff = 0 minutes local date-time = 1952-09-28T01:22:58 diff = 60 minutes local date-time = 1952-09-28T01:22:58 diff = 0 minutes local date-time = 1953-09-27T01:22:58 diff = 60 minutes local date-time = 1953-09-27T01:22:58 diff = 0 minutes local date-time = 1954-09-26T01:22:58 diff = 60 minutes local date-time = 1954-09-26T01:22:58 diff = 0 minutes local date-time = 1955-09-25T01:22:58 diff = 60 minutes local date-time = 1955-09-25T01:22:58 diff = 0 minutes local date-time = 1956-09-30T01:22:58 diff = 60 minutes local date-time = 1956-09-30T01:22:58 diff = 0 minutes local date-time = 1957-09-29T01:22:58 diff = 60 minutes local date-time = 1957-09-29T01:22:58 diff = 0 minutes local date-time = 1958-09-28T01:22:58 diff = 60 minutes local date-time = 1958-09-28T01:22:58 diff = 0 minutes local date-time = 1959-09-27T01:22:58 diff = 60 minutes local date-time = 1959-09-27T01:22:58 diff = 0 minutes local date-time = 1960-09-25T01:22:58 diff = 60 minutes local date-time = 1960-09-25T01:22:58 diff = 0 minutes local date-time = 1961-09-24T01:22:58 diff = 60 minutes local date-time = 1961-09-24T01:22:58 diff = 0 minutes local date-time = 1962-10-28T01:22:58 diff = 60 minutes local date-time = 1962-10-28T01:22:58 diff = 0 minutes local date-time = 1963-10-27T01:22:58 diff = 60 minutes local date-time = 1963-10-27T01:22:58 diff = 0 minutes local date-time = 1964-10-25T01:22:58 diff = 60 minutes local date-time = 1964-10-25T01:22:58 diff = 0 minutes local date-time = 1965-10-31T01:22:58 diff = 60 minutes local date-time = 1965-10-31T01:22:58 diff = 0 minutes local date-time = 1966-10-30T01:22:58 diff = 60 minutes local date-time = 1966-10-30T01:22:58 diff = 0 minutes local date-time = 1967-10-29T01:22:58 diff = 60 minutes local date-time = 1967-10-29T01:22:58 diff = 0 minutes local date-time = 1968-10-27T01:22:58 diff = 60 minutes local date-time = 1968-10-27T01:22:58 diff = 0 minutes local date-time = 1969-10-26T01:22:58 diff = 60 minutes local date-time = 1969-10-26T01:22:58 diff = 0 minutes local date-time = 1970-10-25T01:22:58 di
[jira] [Updated] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time
[ https://issues.apache.org/jira/browse/SPARK-31328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31328: --- Description: Run the following code in the *America/Los_Angeles* time zone: {code:scala} test("rebasing differences") { withDefaultTimeZone(getZoneId("America/Los_Angeles")) { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += 30 * MICROS_PER_MINUTE } println(s"counter = $counter") } } {code} {code:java} local date-time = 0001-01-01T00:00 diff = -2909 minutes local date-time = 0100-02-28T14:00 diff = -1469 minutes local date-time = 0200-02-28T14:00 diff = -29 minutes local date-time = 0300-02-28T14:00 diff = 1410 minutes local date-time = 0500-02-28T14:00 diff = 2850 minutes local date-time = 0600-02-28T14:00 diff = 4290 minutes local date-time = 0700-02-28T14:00 diff = 5730 minutes local date-time = 0900-02-28T14:00 diff = 7170 minutes local date-time = 1000-02-28T14:00 diff = 8610 minutes local date-time = 1100-02-28T14:00 diff = 10050 minutes local date-time = 1300-02-28T14:00 diff = 11490 minutes local date-time = 1400-02-28T14:00 diff = 12930 minutes local date-time = 1500-02-28T14:00 diff = 14370 minutes local date-time = 1582-10-14T14:00 diff = -29 minutes local date-time = 1899-12-31T16:52:58 diff = 0 minutes local date-time = 1917-12-27T11:52:58 diff = 60 minutes local date-time = 1917-12-27T12:52:58 diff = 0 minutes local date-time = 1918-09-15T12:52:58 diff = 60 minutes local date-time = 1918-09-15T13:52:58 diff = 0 minutes local date-time = 1919-06-30T16:52:58 diff = 31 minutes local date-time = 1919-06-30T17:52:58 diff = 0 minutes local date-time = 1919-08-15T12:52:58 diff = 60 minutes local date-time = 1919-08-15T13:52:58 diff = 0 minutes local date-time = 1921-08-31T10:52:58 diff = 60 minutes local date-time = 1921-08-31T11:52:58 diff = 0 minutes local date-time = 1921-09-30T11:52:58 diff = 60 minutes local date-time = 1921-09-30T12:52:58 diff = 0 minutes local date-time = 1922-09-30T12:52:58 diff = 60 minutes local date-time = 1922-09-30T13:52:58 diff = 0 minutes local date-time = 1981-09-30T12:52:58 diff = 60 minutes local date-time = 1981-09-30T13:52:58 diff = 0 minutes local date-time = 1982-09-30T12:52:58 diff = 60 minutes local date-time = 1982-09-30T13:52:58 diff = 0 minutes local date-time = 1983-09-30T12:52:58 diff = 60 minutes local date-time = 1983-09-30T13:52:58 diff = 0 minutes local date-time = 1984-09-29T15:52:58 diff = 60 minutes local date-time = 1984-09-29T16:52:58 diff = 0 minutes local date-time = 1985-09-28T15:52:58 diff = 60 minutes local date-time = 1985-09-28T16:52:58 diff = 0 minutes local date-time = 1986-09-27T15:52:58 diff = 60 minutes local date-time = 1986-09-27T16:52:58 diff = 0 minutes local date-time = 1987-09-26T15:52:58 diff = 60 minutes local date-time = 1987-09-26T16:52:58 diff = 0 minutes local date-time = 1988-09-24T15:52:58 diff = 60 minutes local date-time = 1988-09-24T16:52:58 diff = 0 minutes local date-time = 1989-09-23T15:52:58 diff = 60 minutes local date-time = 1989-09-23T16:52:58 diff = 0 minutes local date-time = 1990-09-29T15:52:58 diff = 60 minutes local date-time = 1990-09-29T16:52:58 diff = 0 minutes local date-time = 1991-09-28T16:52:58 diff = 60 minutes local date-time = 1991-09-28T17:52:58 diff = 0 minutes local date-time = 1992-09-26T15:52:58 diff = 60 minutes local date-time = 1992-09-26T16:52:58 diff = 0 minutes local date-time = 1993-09-25T15:52:58 diff = 60 minutes local date-time = 1993-09-25T16:52:58 diff = 0 minutes local date-time = 1994-09-24T15:52:58 diff = 60 minutes local date-time = 1994-09-24T16:52:58 diff = 0 minutes local date-time = 1995-09-23T15:52:58 diff = 60 minutes local date-time = 1995-09-23T16:52:58 diff = 0 minutes local date-time = 1996-10-26T15:52:58 diff = 60 minutes local date-time = 1996-10-26T16:52:58 diff = 0 minutes local date-time = 1997-10-25T15:52:58 diff = 60 minutes local date-time = 1997-10-25T16:52:58 diff = 0 minutes local date-time = 1998-10-24T15:52:58 diff = 60 minutes local date-time = 1998-10-24T16:52:58 diff = 0 minutes local date-time = 1999-10-30T15:52:58 diff = 60 minutes local date-time = 1999-10-30T16:52:58 diff = 0 minutes local date-time = 2000-10-28T15:52:58 diff = 60 minutes local date-time
[jira] [Created] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time
Maxim Gekk created SPARK-31328: -- Summary: Incorrect timestamps rebasing on autumn daylight saving time Key: SPARK-31328 URL: https://issues.apache.org/jira/browse/SPARK-31328 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Assignee: Maxim Gekk Fix For: 3.0.0 I do believe it is possible to speed up date-time rebasing by building a map of micros to diffs between original and rebased micros. And look up at the map via binary search. For example, the *America/Los_Angeles* time zone has less than 100 points when diff changes: {code:scala} test("optimize rebasing") { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += MICROS_PER_HOUR } println(s"counter = $counter") } {code} {code:java} local date-time = 0001-01-01T00:00 diff = -2909 minutes local date-time = 0100-02-28T14:00 diff = -1469 minutes local date-time = 0200-02-28T14:00 diff = -29 minutes local date-time = 0300-02-28T14:00 diff = 1410 minutes local date-time = 0500-02-28T14:00 diff = 2850 minutes local date-time = 0600-02-28T14:00 diff = 4290 minutes local date-time = 0700-02-28T14:00 diff = 5730 minutes local date-time = 0900-02-28T14:00 diff = 7170 minutes local date-time = 1000-02-28T14:00 diff = 8610 minutes local date-time = 1100-02-28T14:00 diff = 10050 minutes local date-time = 1300-02-28T14:00 diff = 11490 minutes local date-time = 1400-02-28T14:00 diff = 12930 minutes local date-time = 1500-02-28T14:00 diff = 14370 minutes local date-time = 1582-10-14T14:00 diff = -29 minutes local date-time = 1899-12-31T16:52:58 diff = 0 minutes local date-time = 1917-12-27T11:52:58 diff = 60 minutes local date-time = 1917-12-27T12:52:58 diff = 0 minutes local date-time = 1918-09-15T12:52:58 diff = 60 minutes local date-time = 1918-09-15T13:52:58 diff = 0 minutes local date-time = 1919-06-30T16:52:58 diff = 31 minutes local date-time = 1919-06-30T17:52:58 diff = 0 minutes local date-time = 1919-08-15T12:52:58 diff = 60 minutes local date-time = 1919-08-15T13:52:58 diff = 0 minutes local date-time = 1921-08-31T10:52:58 diff = 60 minutes local date-time = 1921-08-31T11:52:58 diff = 0 minutes local date-time = 1921-09-30T11:52:58 diff = 60 minutes local date-time = 1921-09-30T12:52:58 diff = 0 minutes local date-time = 1922-09-30T12:52:58 diff = 60 minutes local date-time = 1922-09-30T13:52:58 diff = 0 minutes local date-time = 1981-09-30T12:52:58 diff = 60 minutes local date-time = 1981-09-30T13:52:58 diff = 0 minutes local date-time = 1982-09-30T12:52:58 diff = 60 minutes local date-time = 1982-09-30T13:52:58 diff = 0 minutes local date-time = 1983-09-30T12:52:58 diff = 60 minutes local date-time = 1983-09-30T13:52:58 diff = 0 minutes local date-time = 1984-09-29T15:52:58 diff = 60 minutes local date-time = 1984-09-29T16:52:58 diff = 0 minutes local date-time = 1985-09-28T15:52:58 diff = 60 minutes local date-time = 1985-09-28T16:52:58 diff = 0 minutes local date-time = 1986-09-27T15:52:58 diff = 60 minutes local date-time = 1986-09-27T16:52:58 diff = 0 minutes local date-time = 1987-09-26T15:52:58 diff = 60 minutes local date-time = 1987-09-26T16:52:58 diff = 0 minutes local date-time = 1988-09-24T15:52:58 diff = 60 minutes local date-time = 1988-09-24T16:52:58 diff = 0 minutes local date-time = 1989-09-23T15:52:58 diff = 60 minutes local date-time = 1989-09-23T16:52:58 diff = 0 minutes local date-time = 1990-09-29T15:52:58 diff = 60 minutes local date-time = 1990-09-29T16:52:58 diff = 0 minutes local date-time = 1991-09-28T16:52:58 diff = 60 minutes local date-time = 1991-09-28T17:52:58 diff = 0 minutes local date-time = 1992-09-26T15:52:58 diff = 60 minutes local date-time = 1992-09-26T16:52:58 diff = 0 minutes local date-time = 1993-09-25T15:52:58 diff = 60 minutes local date-time = 1993-09-25T16:52:58 diff = 0 minutes local date-time = 1994-09-24T15:52:58 diff = 60 minutes local date-time = 1994-09-24T16:52:58 diff = 0 minutes local date-time = 1995-09-23T15:52:58 diff = 60 minutes local date-time = 1995-09-23T16:52:58 diff = 0 minutes local date-time = 1996-10-26T15:52:58 diff = 60 minutes local date-time = 1996-10-26T16:52:58 diff = 0 minutes local dat
[jira] [Updated] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
[ https://issues.apache.org/jira/browse/SPARK-31318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31318: --- Parent: SPARK-30951 Issue Type: Sub-task (was: Improvement) > Split Parquet/Avro configs for rebasing dates/timestamps in read and in write > - > > Key: SPARK-31318 > URL: https://issues.apache.org/jira/browse/SPARK-31318 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, Spark provides 2 SQL configs to control rebasing of > dates/timestamps in Parquet and Avro datasource: > spark.sql.legacy.parquet.rebaseDateTime.enabled > spark.sql.legacy.avro.rebaseDateTime.enabled > The configs control rebasing in read and in write. That's can be inconvenient > for users who want to read files saved by Spark 2.4 and earlier versions, and > save dates/timestamps without rebasing. > The ticket aims to split the configs, and introduce separate SQL configs for > read and for write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
Maxim Gekk created SPARK-31318: -- Summary: Split Parquet/Avro configs for rebasing dates/timestamps in read and in write Key: SPARK-31318 URL: https://issues.apache.org/jira/browse/SPARK-31318 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, Spark provides 2 SQL configs to control rebasing of dates/timestamps in Parquet and Avro datasource: spark.sql.legacy.parquet.rebaseDateTime.enabled spark.sql.legacy.avro.rebaseDateTime.enabled The configs control rebasing in read and in write. That's can be inconvenient for users who want to read files saved by Spark 2.4 and earlier versions, and save dates/timestamps without rebasing. The ticket aims to split the configs, and introduce separate SQL configs for read and for write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31311) Benchmark date-time rebasing in ORC datasource
[ https://issues.apache.org/jira/browse/SPARK-31311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31311: --- Description: * Benchmark saving dates/timestamps before and after 1582-10-15 * Benchmark loading dates/timestamps was: * Add benchmarks for saving dates/timestamps to parquet when spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true * Add bechmark for loading dates/timestamps from parquet when rebasing is on > Benchmark date-time rebasing in ORC datasource > -- > > Key: SPARK-31311 > URL: https://issues.apache.org/jira/browse/SPARK-31311 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > * Benchmark saving dates/timestamps before and after 1582-10-15 > * Benchmark loading dates/timestamps -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31311) Benchmark date-time rebasing in ORC datasource
Maxim Gekk created SPARK-31311: -- Summary: Benchmark date-time rebasing in ORC datasource Key: SPARK-31311 URL: https://issues.apache.org/jira/browse/SPARK-31311 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Assignee: Maxim Gekk Fix For: 3.0.0 * Add benchmarks for saving dates/timestamps to parquet when spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true * Add bechmark for loading dates/timestamps from parquet when rebasing is on -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31297) Speed-up date-time rebasing
[ https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070457#comment-17070457 ] Maxim Gekk commented on SPARK-31297: The rebasing of days doesn't depend on time zone, and has just 14 special dates: {code:scala} test("optimize rebasing") { val start = localDateToDays(LocalDate.of(1, 1, 1)) val end = localDateToDays(LocalDate.of(2030, 1, 1)) var days = start var diff = Long.MaxValue var counter = 0 while (days < end) { val rebased = rebaseGregorianToJulianDays(days) val curDiff = rebased - days if (curDiff != diff) { counter += 1 diff = curDiff val ld = daysToLocalDate(days) println(s"local date = $ld days = $days diff = ${diff} days") } days += 1 } println(s"counter = $counter") } {code} {code} local date = 0001-01-01 days = -719162 diff = -2 days local date = 0100-03-01 days = -682944 diff = -1 days local date = 0200-03-01 days = -646420 diff = 0 days local date = 0300-03-01 days = -609896 diff = 1 days local date = 0500-03-01 days = -536847 diff = 2 days local date = 0600-03-01 days = -500323 diff = 3 days local date = 0700-03-01 days = -463799 diff = 4 days local date = 0900-03-01 days = -390750 diff = 5 days local date = 1000-03-01 days = -354226 diff = 6 days local date = 1100-03-01 days = -317702 diff = 7 days local date = 1300-03-01 days = -244653 diff = 8 days local date = 1400-03-01 days = -208129 diff = 9 days local date = 1500-03-01 days = -171605 diff = 10 days local date = 1582-10-15 days = -141427 diff = 0 days counter = 14 {code} > Speed-up date-time rebasing > --- > > Key: SPARK-31297 > URL: https://issues.apache.org/jira/browse/SPARK-31297 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > I do believe it is possible to speed up date-time rebasing by building a map > of micros to diffs between original and rebased micros. And look up at the > map via binary search. > For example, the *America/Los_Angeles* time zone has less than 100 points > when diff changes: > {code:scala} > test("optimize rebasing") { > val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > var micros = start > var diff = Long.MaxValue > var counter = 0 > while (micros < end) { > val rebased = rebaseGregorianToJulianMicros(micros) > val curDiff = rebased - micros > if (curDiff != diff) { > counter += 1 > diff = curDiff > val ldt = > microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime > println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} > minutes") > } > micros += MICROS_PER_HOUR > } > println(s"counter = $counter") > } > {code} > {code:java} > local date-time = 0001-01-01T00:00 diff = -2909 minutes > local date-time = 0100-02-28T14:00 diff = -1469 minutes > local date-time = 0200-02-28T14:00 diff = -29 minutes > local date-time = 0300-02-28T14:00 diff = 1410 minutes > local date-time = 0500-02-28T14:00 diff = 2850 minutes > local date-time = 0600-02-28T14:00 diff = 4290 minutes > local date-time = 0700-02-28T14:00 diff = 5730 minutes > local date-time = 0900-02-28T14:00 diff = 7170 minutes > local date-time = 1000-02-28T14:00 diff = 8610 minutes > local date-time = 1100-02-28T14:00 diff = 10050 minutes > local date-time = 1300-02-28T14:00 diff = 11490 minutes > local date-time = 1400-02-28T14:00 diff = 12930 minutes > local date-time = 1500-02-28T14:00 diff = 14370 minutes > local date-time = 1582-10-14T14:00 diff = -29 minutes > local date-time = 1899-12-31T16:52:58 diff = 0 minutes > local date-time = 1917-12-27T11:52:58 diff = 60 minutes > local date-time = 1917-12-27T12:52:58 diff = 0 minutes > local date-time = 1918-09-15T12:52:58 diff = 60 minutes > local date-time = 1918-09-15T13:52:58 diff = 0 minutes > local date-time = 1919-06-30T16:52:58 diff = 31 minutes > local date-time = 1919-06-30T17:52:58 diff = 0 minutes > local date-time = 1919-08-15T12:52:58 diff = 60 minutes > local date-time = 1919-08-15T13:52:58 diff = 0 minutes > local date-time = 1921-08-31T10:52:58 diff = 60 minutes > local date-time = 1921-08-31T11:52:58 diff = 0 minutes > local date-time = 1921-09-30T11:52:58 diff = 60 minutes > local date-time = 1921-09-30T12:52:58 diff = 0 minutes > local date-time = 1922-09-30T12:52:58 diff = 60 minutes > local date-time = 1922-09-30T13:52:58 diff = 0 minutes > local date-time = 1981-09-30T12:52:58 diff = 60 minutes > local date-t
[jira] [Commented] (SPARK-31297) Speed-up date-time rebasing
[ https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070286#comment-17070286 ] Maxim Gekk commented on SPARK-31297: [~cloud_fan] [~hyukjin.kwon] [~dongjoon] WDYT? > Speed-up date-time rebasing > --- > > Key: SPARK-31297 > URL: https://issues.apache.org/jira/browse/SPARK-31297 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > I do believe it is possible to speed up date-time rebasing by building a map > of micros to diffs between original and rebased micros. And look up at the > map via binary search. > For example, the *America/Los_Angeles* time zone has less than 100 points > when diff changes: > {code:scala} > test("optimize rebasing") { > val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > var micros = start > var diff = Long.MaxValue > var counter = 0 > while (micros < end) { > val rebased = rebaseGregorianToJulianMicros(micros) > val curDiff = rebased - micros > if (curDiff != diff) { > counter += 1 > diff = curDiff > val ldt = > microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime > println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} > minutes") > } > micros += MICROS_PER_HOUR > } > println(s"counter = $counter") > } > {code} > {code:java} > local date-time = 0001-01-01T00:00 diff = -2909 minutes > local date-time = 0100-02-28T14:00 diff = -1469 minutes > local date-time = 0200-02-28T14:00 diff = -29 minutes > local date-time = 0300-02-28T14:00 diff = 1410 minutes > local date-time = 0500-02-28T14:00 diff = 2850 minutes > local date-time = 0600-02-28T14:00 diff = 4290 minutes > local date-time = 0700-02-28T14:00 diff = 5730 minutes > local date-time = 0900-02-28T14:00 diff = 7170 minutes > local date-time = 1000-02-28T14:00 diff = 8610 minutes > local date-time = 1100-02-28T14:00 diff = 10050 minutes > local date-time = 1300-02-28T14:00 diff = 11490 minutes > local date-time = 1400-02-28T14:00 diff = 12930 minutes > local date-time = 1500-02-28T14:00 diff = 14370 minutes > local date-time = 1582-10-14T14:00 diff = -29 minutes > local date-time = 1899-12-31T16:52:58 diff = 0 minutes > local date-time = 1917-12-27T11:52:58 diff = 60 minutes > local date-time = 1917-12-27T12:52:58 diff = 0 minutes > local date-time = 1918-09-15T12:52:58 diff = 60 minutes > local date-time = 1918-09-15T13:52:58 diff = 0 minutes > local date-time = 1919-06-30T16:52:58 diff = 31 minutes > local date-time = 1919-06-30T17:52:58 diff = 0 minutes > local date-time = 1919-08-15T12:52:58 diff = 60 minutes > local date-time = 1919-08-15T13:52:58 diff = 0 minutes > local date-time = 1921-08-31T10:52:58 diff = 60 minutes > local date-time = 1921-08-31T11:52:58 diff = 0 minutes > local date-time = 1921-09-30T11:52:58 diff = 60 minutes > local date-time = 1921-09-30T12:52:58 diff = 0 minutes > local date-time = 1922-09-30T12:52:58 diff = 60 minutes > local date-time = 1922-09-30T13:52:58 diff = 0 minutes > local date-time = 1981-09-30T12:52:58 diff = 60 minutes > local date-time = 1981-09-30T13:52:58 diff = 0 minutes > local date-time = 1982-09-30T12:52:58 diff = 60 minutes > local date-time = 1982-09-30T13:52:58 diff = 0 minutes > local date-time = 1983-09-30T12:52:58 diff = 60 minutes > local date-time = 1983-09-30T13:52:58 diff = 0 minutes > local date-time = 1984-09-29T15:52:58 diff = 60 minutes > local date-time = 1984-09-29T16:52:58 diff = 0 minutes > local date-time = 1985-09-28T15:52:58 diff = 60 minutes > local date-time = 1985-09-28T16:52:58 diff = 0 minutes > local date-time = 1986-09-27T15:52:58 diff = 60 minutes > local date-time = 1986-09-27T16:52:58 diff = 0 minutes > local date-time = 1987-09-26T15:52:58 diff = 60 minutes > local date-time = 1987-09-26T16:52:58 diff = 0 minutes > local date-time = 1988-09-24T15:52:58 diff = 60 minutes > local date-time = 1988-09-24T16:52:58 diff = 0 minutes > local date-time = 1989-09-23T15:52:58 diff = 60 minutes > local date-time = 1989-09-23T16:52:58 diff = 0 minutes > local date-time = 1990-09-29T15:52:58 diff = 60 minutes > local date-time = 1990-09-29T16:52:58 diff = 0 minutes > local date-time = 1991-09-28T16:52:58 diff = 60 minutes > local date-time = 1991-09-28T17:52:58 diff = 0 minutes > local date-time = 1992-09-26T15:52:58 diff = 60 minutes > local date-time = 1992-09-26T16:52:58 diff = 0 minutes > local date-time = 1993-09-25T15:52:58 diff = 60 minutes > local date-time = 1993-09-25T16:52:
[jira] [Created] (SPARK-31297) Speed-up date-time rebasing
Maxim Gekk created SPARK-31297: -- Summary: Speed-up date-time rebasing Key: SPARK-31297 URL: https://issues.apache.org/jira/browse/SPARK-31297 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk I do believe it is possible to speed up date-time rebasing by building a map of micros to diffs between original and rebased micros. And look up at the map via binary search. For example, the *America/Los_Angeles* time zone has less than 100 points when diff changes: {code:scala} test("optimize rebasing") { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += MICROS_PER_HOUR } println(s"counter = $counter") } {code} {code:java} local date-time = 0001-01-01T00:00 diff = -2909 minutes local date-time = 0100-02-28T14:00 diff = -1469 minutes local date-time = 0200-02-28T14:00 diff = -29 minutes local date-time = 0300-02-28T14:00 diff = 1410 minutes local date-time = 0500-02-28T14:00 diff = 2850 minutes local date-time = 0600-02-28T14:00 diff = 4290 minutes local date-time = 0700-02-28T14:00 diff = 5730 minutes local date-time = 0900-02-28T14:00 diff = 7170 minutes local date-time = 1000-02-28T14:00 diff = 8610 minutes local date-time = 1100-02-28T14:00 diff = 10050 minutes local date-time = 1300-02-28T14:00 diff = 11490 minutes local date-time = 1400-02-28T14:00 diff = 12930 minutes local date-time = 1500-02-28T14:00 diff = 14370 minutes local date-time = 1582-10-14T14:00 diff = -29 minutes local date-time = 1899-12-31T16:52:58 diff = 0 minutes local date-time = 1917-12-27T11:52:58 diff = 60 minutes local date-time = 1917-12-27T12:52:58 diff = 0 minutes local date-time = 1918-09-15T12:52:58 diff = 60 minutes local date-time = 1918-09-15T13:52:58 diff = 0 minutes local date-time = 1919-06-30T16:52:58 diff = 31 minutes local date-time = 1919-06-30T17:52:58 diff = 0 minutes local date-time = 1919-08-15T12:52:58 diff = 60 minutes local date-time = 1919-08-15T13:52:58 diff = 0 minutes local date-time = 1921-08-31T10:52:58 diff = 60 minutes local date-time = 1921-08-31T11:52:58 diff = 0 minutes local date-time = 1921-09-30T11:52:58 diff = 60 minutes local date-time = 1921-09-30T12:52:58 diff = 0 minutes local date-time = 1922-09-30T12:52:58 diff = 60 minutes local date-time = 1922-09-30T13:52:58 diff = 0 minutes local date-time = 1981-09-30T12:52:58 diff = 60 minutes local date-time = 1981-09-30T13:52:58 diff = 0 minutes local date-time = 1982-09-30T12:52:58 diff = 60 minutes local date-time = 1982-09-30T13:52:58 diff = 0 minutes local date-time = 1983-09-30T12:52:58 diff = 60 minutes local date-time = 1983-09-30T13:52:58 diff = 0 minutes local date-time = 1984-09-29T15:52:58 diff = 60 minutes local date-time = 1984-09-29T16:52:58 diff = 0 minutes local date-time = 1985-09-28T15:52:58 diff = 60 minutes local date-time = 1985-09-28T16:52:58 diff = 0 minutes local date-time = 1986-09-27T15:52:58 diff = 60 minutes local date-time = 1986-09-27T16:52:58 diff = 0 minutes local date-time = 1987-09-26T15:52:58 diff = 60 minutes local date-time = 1987-09-26T16:52:58 diff = 0 minutes local date-time = 1988-09-24T15:52:58 diff = 60 minutes local date-time = 1988-09-24T16:52:58 diff = 0 minutes local date-time = 1989-09-23T15:52:58 diff = 60 minutes local date-time = 1989-09-23T16:52:58 diff = 0 minutes local date-time = 1990-09-29T15:52:58 diff = 60 minutes local date-time = 1990-09-29T16:52:58 diff = 0 minutes local date-time = 1991-09-28T16:52:58 diff = 60 minutes local date-time = 1991-09-28T17:52:58 diff = 0 minutes local date-time = 1992-09-26T15:52:58 diff = 60 minutes local date-time = 1992-09-26T16:52:58 diff = 0 minutes local date-time = 1993-09-25T15:52:58 diff = 60 minutes local date-time = 1993-09-25T16:52:58 diff = 0 minutes local date-time = 1994-09-24T15:52:58 diff = 60 minutes local date-time = 1994-09-24T16:52:58 diff = 0 minutes local date-time = 1995-09-23T15:52:58 diff = 60 minutes local date-time = 1995-09-23T16:52:58 diff = 0 minutes local date-time = 1996-10-26T15:52:58 diff = 60 minutes local date-time = 1996-10-26T16:52:58 diff = 0 minutes local date-time = 1997-10-25T15:52:58 diff = 60 minutes local date-time = 1997-10-25T16:52:58 diff =
[jira] [Updated] (SPARK-31296) Benchmark date-time rebasing in Parquet datasource
[ https://issues.apache.org/jira/browse/SPARK-31296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31296: --- Summary: Benchmark date-time rebasing in Parquet datasource (was: Benchmark date-time rebasing to/from Julian calendar) > Benchmark date-time rebasing in Parquet datasource > -- > > Key: SPARK-31296 > URL: https://issues.apache.org/jira/browse/SPARK-31296 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > * Add benchmarks for saving dates/timestamps to parquet when > spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true > * Add bechmark for loading dates/timestamps from parquet when rebasing is on -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31296) Benchmark date-time rebasing to/from Julian calendar
Maxim Gekk created SPARK-31296: -- Summary: Benchmark date-time rebasing to/from Julian calendar Key: SPARK-31296 URL: https://issues.apache.org/jira/browse/SPARK-31296 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk * Add benchmarks for saving dates/timestamps to parquet when spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true * Add bechmark for loading dates/timestamps from parquet when rebasing is on -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31286) Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-31286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31286: --- Description: There are two distinct types of ID (see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html): # Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the same offset for all local date-times # Geographical regions - an area where a specific set of rules for finding the offset from UTC/Greenwich apply For example three-letter time zone IDs are ambitious, and depend on the locale. They have been already deprecated in JDK, see https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html : {code} For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such as "PST", "CTT", "AST") are also supported. However, their use is deprecated because the same abbreviation is often used for multiple time zones (for example, "CST" could be U.S. "Central Standard Time" and "China Standard Time"), and the Java platform can then only recognize one of them. {code} The ticket aims to specify formats of the `timeZone` option in JSON/CSV datasource, and the `tz` parameter of the from_utc_timestamp() and to_utc_timestamp() functions. was: There are two distinct types of ID (see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html): # Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the same offset for all local date-times # Geographical regions - an area where a specific set of rules for finding the offset from UTC/Greenwich apply For example three-letter time zone IDs are ambitious, and depend on the locale. They have been already deprecated in JDK, see https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html : {code} For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such as "PST", "CTT", "AST") are also supported. However, their use is deprecated because the same abbreviation is often used for multiple time zones (for example, "CST" could be U.S. "Central Standard Time" and "China Standard Time"), and the Java platform can then only recognize one of them. {code} The ticket aims to specify formats of the SQL config *spark.sql.session.timeZone* in the 2 forms mentioned above. > Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp > - > > Key: SPARK-31286 > URL: https://issues.apache.org/jira/browse/SPARK-31286 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > There are two distinct types of ID (see > https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html): > # Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the > same offset for all local date-times > # Geographical regions - an area where a specific set of rules for finding > the offset from UTC/Greenwich apply > For example three-letter time zone IDs are ambitious, and depend on the > locale. They have been already deprecated in JDK, see > https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html : > {code} > For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such > as "PST", "CTT", "AST") are also supported. However, their use is deprecated > because the same abbreviation is often used for multiple time zones (for > example, "CST" could be U.S. "Central Standard Time" and "China Standard > Time"), and the Java platform can then only recognize one of them. > {code} > The ticket aims to specify formats of the `timeZone` option in JSON/CSV > datasource, and the `tz` parameter of the from_utc_timestamp() and > to_utc_timestamp() functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31286) Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp
Maxim Gekk created SPARK-31286: -- Summary: Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp Key: SPARK-31286 URL: https://issues.apache.org/jira/browse/SPARK-31286 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 2.4.5, 3.0.0 Reporter: Maxim Gekk Assignee: Maxim Gekk Fix For: 3.0.0 There are two distinct types of ID (see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html): # Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the same offset for all local date-times # Geographical regions - an area where a specific set of rules for finding the offset from UTC/Greenwich apply For example three-letter time zone IDs are ambitious, and depend on the locale. They have been already deprecated in JDK, see https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html : {code} For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such as "PST", "CTT", "AST") are also supported. However, their use is deprecated because the same abbreviation is often used for multiple time zones (for example, "CST" could be U.S. "Central Standard Time" and "China Standard Time"), and the Java platform can then only recognize one of them. {code} The ticket aims to specify formats of the SQL config *spark.sql.session.timeZone* in the 2 forms mentioned above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31284) Check rebasing of timestamps in ORC datasource
Maxim Gekk created SPARK-31284: -- Summary: Check rebasing of timestamps in ORC datasource Key: SPARK-31284 URL: https://issues.apache.org/jira/browse/SPARK-31284 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Add tests to check that timestamps saved by Spark 2.4 are loaded back by Spark 3.0 correctly. Also add tests for timestamps rebasing in write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31277) Migrate `DateTimeTestUtils` from `TimeZone` to `ZoneId`
Maxim Gekk created SPARK-31277: -- Summary: Migrate `DateTimeTestUtils` from `TimeZone` to `ZoneId` Key: SPARK-31277 URL: https://issues.apache.org/jira/browse/SPARK-31277 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, Spark SQL's date-time expressions and functions are ported on Java 8 time API but tests still use old time APIs. In particular, DateTimeTestUtils exposes functions that accept only TimeZone instances. This is inconvenient, and CPU consuming because need to convert TimeZone instances to ZoneId instances via strings (zone ids). The tickets aims to replace TimeZone parameters of DateTimeTestUtils functions by ZoneId type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31254) `HiveResult.toHiveString` does not use the current session time zone
Maxim Gekk created SPARK-31254: -- Summary: `HiveResult.toHiveString` does not use the current session time zone Key: SPARK-31254 URL: https://issues.apache.org/jira/browse/SPARK-31254 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, date/timestamp formatters in `HiveResult.toHiveString` are initialized once on instantiation of the `HiveResult` object, and pick up the session time zone. If the sessions time zone is changed, the formatters still use the previous one. See the discussion there https://github.com/apache/spark/pull/23391#discussion_r397347820 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31238) Incompatible ORC dates with Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-31238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066427#comment-17066427 ] Maxim Gekk commented on SPARK-31238: I am working on the issue. > Incompatible ORC dates with Spark 2.4 > - > > Key: SPARK-31238 > URL: https://issues.apache.org/jira/browse/SPARK-31238 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Blocker > > Using Spark 2.4.5, write pre-1582 date to ORC file and then read it: > {noformat} > $ export TZ=UTC > $ bin/spark-shell --conf spark.sql.session.timeZone=UTC > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5-SNAPSHOT > /_/ > > Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_161) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sql("select cast('1200-01-01' as date) > dt").write.mode("overwrite").orc("/tmp/datefile") > scala> spark.read.orc("/tmp/datefile").show > +--+ > |dt| > +--+ > |1200-01-01| > +--+ > scala> :quit > {noformat} > Using Spark 3.0 (branch-3.0 at commit a934142f24), read the same file: > {noformat} > $ export TZ=UTC > $ bin/spark-shell --conf spark.sql.session.timeZone=UTC > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_161) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.read.orc("/tmp/datefile").show > +--+ > |dt| > +--+ > |1200-01-08| > +--+ > scala> > {noformat} > Dates are off. > Timestamps, on the other hand, appear to work as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31237) Replace 3-letter time zones by zone offsets
Maxim Gekk created SPARK-31237: -- Summary: Replace 3-letter time zones by zone offsets Key: SPARK-31237 URL: https://issues.apache.org/jira/browse/SPARK-31237 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk 3-letter time zones are ambitious, and have been already deprecated in JDK, see [https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html] . Also, some short names are mapped to region-based zone IDs, and don't conform to actual definitions. For example, the PST short name is mapped to America/Los_Angeles. It has different zone offsets in Java 7 and Java 8 APIs: {code:scala} scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-05 23:00:00").getTime)/360.0 res11: Double = -7.0 scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 00:00:00").getTime)/360.0 res12: Double = -7.0 scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 01:00:00").getTime)/360.0 res13: Double = -8.0 scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 02:00:00").getTime)/360.0 res14: Double = -8.0 scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 03:00:00").getTime)/360.0 res15: Double = -8.0 {code} and in Java 8 API https://github.com/apache/spark/pull/27980#discussion_r396287278 By definition, PST must be a constant and equals to UTC-08:00, see https://www.timeanddate.com/time/zones/pst The ticket aims to replace all short time zone names by zone offsets in tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31232) Specify formats of `spark.sql.session.timeZone`
Maxim Gekk created SPARK-31232: -- Summary: Specify formats of `spark.sql.session.timeZone` Key: SPARK-31232 URL: https://issues.apache.org/jira/browse/SPARK-31232 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 2.4.5, 3.0.0 Reporter: Maxim Gekk There are two distinct types of ID (see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html): # Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the same offset for all local date-times # Geographical regions - an area where a specific set of rules for finding the offset from UTC/Greenwich apply For example three-letter time zone IDs are ambitious, and depend on the locale. They have been already deprecated in JDK, see https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html : {code} For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such as "PST", "CTT", "AST") are also supported. However, their use is deprecated because the same abbreviation is often used for multiple time zones (for example, "CST" could be U.S. "Central Standard Time" and "China Standard Time"), and the Java platform can then only recognize one of them. {code} The ticket aims to specify formats of the SQL config *spark.sql.session.timeZone* in the 2 forms mentioned above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31212) Failure of casting the '1000-02-29' string to the date type
[ https://issues.apache.org/jira/browse/SPARK-31212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064550#comment-17064550 ] Maxim Gekk commented on SPARK-31212: I think it would be better to use isLeapYear of GregorianCalendar, [https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html#isLeapYear(int)] There are other suspicious functions that need to review [https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L608-L610] > Failure of casting the '1000-02-29' string to the date type > --- > > Key: SPARK-31212 > URL: https://issues.apache.org/jira/browse/SPARK-31212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Maxim Gekk >Priority: Major > > The '1000-02-29' is valid date in the Julian calendar used in Spark 2.4.5 for > dates before 1582-10-15 but casting the string to the date type fails: > {code:scala} > scala> val df = > Seq("1000-02-29").toDF("dateS").select($"dateS".cast("date").as("date")) > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > ++ > |date| > ++ > |null| > ++ > {code} > Creating a dataset from java.sql.Date w/ the same input string works > correctly: > {code:scala} > scala> val df2 = > Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date")) > df2: org.apache.spark.sql.DataFrame = [date: date] > scala> df2.show > +--+ > | date| > +--+ > |1000-02-29| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31221) Rebase all dates/timestamps in conversion in Java types
Maxim Gekk created SPARK-31221: -- Summary: Rebase all dates/timestamps in conversion in Java types Key: SPARK-31221 URL: https://issues.apache.org/jira/browse/SPARK-31221 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, the fromJavaDate(), toJavaDate(), toJavaTimestamp() and fromJavaTimestamp() methods of DateTimeUtils perform rebase only dates before Gregorian cutover date 1582-10-15 assuming that Gregorian calendar has the same behavior in Java 7 and Java 8 API. The assumption is incorrect, in particular, in getting zone offsets, for instance: {code:scala} scala> java.time.ZoneId.systemDefault res16: java.time.ZoneId = America/Los_Angeles scala> java.sql.Timestamp.valueOf("1883-11-10 00:00:00").getTimezoneOffset / 60.0 warning: there was one deprecation warning; re-run with -deprecation for details res17: Double = 8.0 scala> java.time.ZoneId.of("America/Los_Angeles").getRules.getOffset(java.time.LocalDateTime.parse("1883-11-10T00:00:00")) res18: java.time.ZoneOffset = -07:52:58 {code} Java 7 is not accurate, America/Los_Angeles changed time zone shift from {code} -7:52:58 {code} to {code} -8:00 {code} The ticket aims to perform rebasing for any dates/timestamps independently from calendar cutover date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064376#comment-17064376 ] Maxim Gekk commented on SPARK-31183: [~koert] The problem will be resolved soon, see https://github.com/apache/spark/pull/27964#issuecomment-602152201 > Incompatible Avro dates/timestamps with Spark 2.4 > - > > Key: SPARK-31183 > URL: https://issues.apache.org/jira/browse/SPARK-31183 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Write dates/timestamps to Avro file in Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 > {code} > {code:scala} > scala> > df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |date | > +--+ > |1001-01-01| > +--+ > scala> > df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-01 01:02:03.123456| > +--+ > {code} > Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from > Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 > {code} > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) > +--+ > |date | > +--+ > |1001-01-07| > +--+ > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-07 01:09:05.123456| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31212) Failure of casting the '1000-02-29' string to the date type
[ https://issues.apache.org/jira/browse/SPARK-31212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064017#comment-17064017 ] Maxim Gekk commented on SPARK-31212: The isLeapYear() function in 2.4 assumes Proleptic Gregorian calendar: https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L600-L602 but actually Spark 2.4 is based on the hybrid calendar Julian+Gregorian as we can see at https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L513-L517 It means the following functions in DateTimeUtils return incorrect results for dates before Gregorian cutover days: # getQuarter # splitDate # getMonth # getDayOfMonth # firstDayOfMonth # dateAddMonths # stringToTimestamp # stringToDate # monthsBetween # getLastDayOfMonth /cc [~cloud_fan] [~hyukjin.kwon] > Failure of casting the '1000-02-29' string to the date type > --- > > Key: SPARK-31212 > URL: https://issues.apache.org/jira/browse/SPARK-31212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Maxim Gekk >Priority: Major > > The '1000-02-29' is valid date in the Julian calendar used in Spark 2.4.5 for > dates before 1582-10-15 but casting the string to the date type fails: > {code:scala} > scala> val df = > Seq("1000-02-29").toDF("dateS").select($"dateS".cast("date").as("date")) > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > ++ > |date| > ++ > |null| > ++ > {code} > Creating a dataset from java.sql.Date w/ the same input string works > correctly: > {code:scala} > scala> val df2 = > Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date")) > df2: org.apache.spark.sql.DataFrame = [date: date] > scala> df2.show > +--+ > | date| > +--+ > |1000-02-29| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31212) Failure of casting the '1000-02-29' string to the date type
Maxim Gekk created SPARK-31212: -- Summary: Failure of casting the '1000-02-29' string to the date type Key: SPARK-31212 URL: https://issues.apache.org/jira/browse/SPARK-31212 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5 Reporter: Maxim Gekk The '1000-02-29' is valid date in the Julian calendar used in Spark 2.4.5 for dates before 1582-10-15 but casting the string to the date type fails: {code:scala} scala> val df = Seq("1000-02-29").toDF("dateS").select($"dateS".cast("date").as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.show ++ |date| ++ |null| ++ {code} Creating a dataset from java.sql.Date w/ the same input string works correctly: {code:scala} scala> val df2 = Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date")) df2: org.apache.spark.sql.DataFrame = [date: date] scala> df2.show +--+ | date| +--+ |1000-02-29| +--+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31211) Failure on loading 1000-02-29 from parquet saved by Spark 2.4.5
Maxim Gekk created SPARK-31211: -- Summary: Failure on loading 1000-02-29 from parquet saved by Spark 2.4.5 Key: SPARK-31211 URL: https://issues.apache.org/jira/browse/SPARK-31211 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Save valid date in Julian calendar by Spark 2.4.5 in a leap year, for instance 1000-02-29: {code} $ export TZ="America/Los_Angeles" {code} {code:scala} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> df.write.mode("overwrite").format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro_leap") scala> val df = Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.show +--+ | date| +--+ |1000-02-29| +--+ scala> df.write.mode("overwrite").parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap") {code} Load the parquet files back by Spark 3.1.0-SNAPSHOT: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231) Type in expressions to have them evaluated. Type :help for more information. scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show +--+ | date| +--+ |1000-03-06| +--+ scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show 20/03/21 03:03:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.time.DateTimeException: Invalid date 'February 29' as '1000' is not a leap year at java.time.LocalDate.create(LocalDate.java:429) at java.time.LocalDate.of(LocalDate.java:269) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.rebaseJulianToGregorianDays(DateTimeUtils.scala:1008) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31195) Reuse days rebase functions of DateTimeUtils in DaysWritable
Maxim Gekk created SPARK-31195: -- Summary: Reuse days rebase functions of DateTimeUtils in DaysWritable Key: SPARK-31195 URL: https://issues.apache.org/jira/browse/SPARK-31195 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk The functions rebaseJulianToGregorianDays() and rebaseGregorianToJulianDays() were added by the PR https://github.com/apache/spark/pull/27915. The ticket aims to replace similar code in org.apache.spark.sql.hive.DaysWritable by the functions to: # Deduplicate code # The functions were better tested, and cross checked by reading parquet files saved by Spark 2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061902#comment-17061902 ] Maxim Gekk commented on SPARK-31183: I am working on the issue. > Incompatible Avro dates/timestamps with Spark 2.4 > - > > Key: SPARK-31183 > URL: https://issues.apache.org/jira/browse/SPARK-31183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Write dates/timestamps to Avro file in Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 > {code} > {code:scala} > scala> > df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |date | > +--+ > |1001-01-01| > +--+ > scala> > df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-01 01:02:03.123456| > +--+ > {code} > Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from > Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 > {code} > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) > +--+ > |date | > +--+ > |1001-01-07| > +--+ > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-07 01:09:05.123456| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061903#comment-17061903 ] Maxim Gekk commented on SPARK-31183: [~cloud_fan] FYI > Incompatible Avro dates/timestamps with Spark 2.4 > - > > Key: SPARK-31183 > URL: https://issues.apache.org/jira/browse/SPARK-31183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Write dates/timestamps to Avro file in Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 > {code} > {code:scala} > scala> > df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |date | > +--+ > |1001-01-01| > +--+ > scala> > df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-01 01:02:03.123456| > +--+ > {code} > Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from > Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 > {code} > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) > +--+ > |date | > +--+ > |1001-01-07| > +--+ > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-07 01:09:05.123456| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4
Maxim Gekk created SPARK-31183: -- Summary: Incompatible Avro dates/timestamps with Spark 2.4 Key: SPARK-31183 URL: https://issues.apache.org/jira/browse/SPARK-31183 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Write dates/timestamps to Avro file in Spark 2.4.5: {code} $ export TZ="America/Los_Angeles" $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 {code} {code:scala} scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) +--+ |date | +--+ |1001-01-01| +--+ scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) +--+ |ts| +--+ |1001-01-01 01:02:03.123456| +--+ {code} Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from Spark 2.4.5: {code} $ export TZ="America/Los_Angeles" $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 {code} {code:scala} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +--+ |date | +--+ |1001-01-07| +--+ scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) +--+ |ts| +--+ |1001-01-07 01:09:05.123456| +--+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31159) Incompatible Parquet dates/timestamps with Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-31159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059617#comment-17059617 ] Maxim Gekk commented on SPARK-31159: [~cloud_fan] FYI > Incompatible Parquet dates/timestamps with Spark 2.4 > > > Key: SPARK-31159 > URL: https://issues.apache.org/jira/browse/SPARK-31159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Write dates/timestamps to Parquet file in Spark 2.4: > {code} > $ export TZ="UTC" > $ ~/spark-2.4/bin/spark-shell > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5 > /_/ > Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_231) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.conf.set("spark.sql.session.timeZone", "UTC") > scala> val df = Seq(("1001-01-01", "1001-01-01 > 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), > $"tsS".cast("timestamp").as("ts")) > df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp] > scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros") > scala> spark.conf.set("spark.sql.parquet.outputTimestampType", > "TIMESTAMP_MICROS") > scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros") > scala> > spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false) > +--+--+ > |d |ts| > +--+--+ > |1001-01-01|1001-01-01 01:02:03.123456| > +--+--+ > {code} > Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool > prints *1001-01-07* and *1001-01-07T01:02:03.123456+*: > {code} > $ java -jar > /Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar > dump -m > ./2_4_5_micros/part-0-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet > INT32 d > > *** row group 1 of 1, values 1 to 1 *** > value 1: R:0 D:1 V:1001-01-07 > INT64 ts > > *** row group 1 of 1, values 1 to 1 *** > value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+ > {code} > Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but > different values from Spark 2.4: > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_231) > scala> > spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false) > +--+--+ > |d |ts| > +--+--+ > |1001-01-07|1001-01-07 01:02:03.123456| > +--+--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31159) Incompatible Parquet dates/timestamps with Spark 2.4
Maxim Gekk created SPARK-31159: -- Summary: Incompatible Parquet dates/timestamps with Spark 2.4 Key: SPARK-31159 URL: https://issues.apache.org/jira/browse/SPARK-31159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Write dates/timestamps to Parquet file in Spark 2.4: {code} $ export TZ="UTC" $ ~/spark-2.4/bin/spark-shell Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231) Type in expressions to have them evaluated. Type :help for more information. scala> spark.conf.set("spark.sql.session.timeZone", "UTC") scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts")) df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp] scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros") scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros") scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false) +--+--+ |d |ts| +--+--+ |1001-01-01|1001-01-01 01:02:03.123456| +--+--+ {code} Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool prints *1001-01-07* and *1001-01-07T01:02:03.123456+*: {code} $ java -jar /Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar dump -m ./2_4_5_micros/part-0-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet INT32 d *** row group 1 of 1, values 1 to 1 *** value 1: R:0 D:1 V:1001-01-07 INT64 ts *** row group 1 of 1, values 1 to 1 *** value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+ {code} Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but different values from Spark 2.4: {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231) scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false) +--+--+ |d |ts| +--+--+ |1001-01-07|1001-01-07 01:02:03.123456| +--+--+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30565) Regression in the ORC benchmark
[ https://issues.apache.org/jira/browse/SPARK-30565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057619#comment-17057619 ] Maxim Gekk commented on SPARK-30565: Per [~dongjoon] , default ORC reader doesn't fully cover functionality of Hive ORC reader. Maybe, some users have to use the former one in some cases. > Regression in the ORC benchmark > --- > > Key: SPARK-30565 > URL: https://issues.apache.org/jira/browse/SPARK-30565 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > New benchmark results generated in the PR > [https://github.com/apache/spark/pull/27078] show regression ~3 times. > Before: > {code} > Hive built-in ORC 520531 >8 2.0 495.8 0.6X > {code} > https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dL138 > After: > {code} > Hive built-in ORC 1761 1792 > 43 0.61679.3 0.1X > {code} > https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dR138 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
Maxim Gekk created SPARK-31076: -- Summary: Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time Key: SPARK-31076 URL: https://issues.apache.org/jira/browse/SPARK-31076 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk By default, collect() returns java.sql.Timestamp/Date instances with offsets derived from internal values of Catalyst's TIMESTAMP/DATE that store microseconds since the epoch. The conversion from internal values to java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting the resulted values before 1582 year to strings produces timestamp/date string in Julian calendar. For example: {code} scala> sql("select date '1100-10-10'").collect() res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03]) {code} This can be fixed if internal Catalyst's values are converted to local date-time in Gregorian calendar, and construct local date-time from the resulted year, month, ..., seconds in Julian calendar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31044) Support foldable input by `schema_of_json`
Maxim Gekk created SPARK-31044: -- Summary: Support foldable input by `schema_of_json` Key: SPARK-31044 URL: https://issues.apache.org/jira/browse/SPARK-31044 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, the `schema_of_json()` function allows only string literal as the input. The ticket aims to support any foldable string expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30563) Regressions in Join benchmarks
[ https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051076#comment-17051076 ] Maxim Gekk commented on SPARK-30563: [~petertoth] If you think it is possible to avoid some overhead of NoOp datasource, please, open a PR. > Regressions in Join benchmarks > -- > > Key: SPARK-30563 > URL: https://issues.apache.org/jira/browse/SPARK-30563 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Regenerated benchmark results in the > https://github.com/apache/spark/pull/27078 shows many regressions in > JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see > old results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10 > new results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10 > One of the difference in queries is using the `NoOp` datasource in new > queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30563) Regressions in Join benchmarks
[ https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051073#comment-17051073 ] Maxim Gekk commented on SPARK-30563: > we spend a lot of time in this loop even The loop just forces materialization of joined rows. By df.groupBy().count(), you skip some steps in join, it seems. I think in most cases, users need results of join but not just count on top of it. > Regressions in Join benchmarks > -- > > Key: SPARK-30563 > URL: https://issues.apache.org/jira/browse/SPARK-30563 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Regenerated benchmark results in the > https://github.com/apache/spark/pull/27078 shows many regressions in > JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see > old results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10 > new results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10 > One of the difference in queries is using the `NoOp` datasource in new > queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31025) Support foldable input by `schema_of_csv`
Maxim Gekk created SPARK-31025: -- Summary: Support foldable input by `schema_of_csv` Key: SPARK-31025 URL: https://issues.apache.org/jira/browse/SPARK-31025 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, the `schema_of_csv()` function allows only string literal as the input. The ticket aims to support any foldable string expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31020) Support foldable schemas by `from_csv`
[ https://issues.apache.org/jira/browse/SPARK-31020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31020: --- Description: Currently, Spark accepts only literals or schema_of_csv w/ literal input as the schema parameter of from_csv. And it fails on any foldable expressions, for instance: {code:sql} spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_csv function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 {code} There are no reasons to restrict users by literals. The ticket aims to support any foldable schemas by from_csv(). was: Currently, Spark accepts only literals or schema_of_csv w/ literal input as the schema parameter of from_csv. And it fails on any foldable expressions, for instance: {code:sql} spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_csv function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 {code} There is reasons to restrict users by literals. The ticket aims to support any foldable schemas by from_csv(). > Support foldable schemas by `from_csv` > -- > > Key: SPARK-31020 > URL: https://issues.apache.org/jira/browse/SPARK-31020 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, Spark accepts only literals or schema_of_csv w/ literal input as > the schema parameter of from_csv. And it fails on any foldable expressions, > for instance: > {code:sql} > spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city > STRING', 'dpt_org_', '')); > Error in query: Schema should be specified in DDL format as a string literal > or output of the schema_of_csv function instead of replace('dpt_org_id INT, > dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 > {code} > There are no reasons to restrict users by literals. The ticket aims to > support any foldable schemas by from_csv(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31023) Support foldable schemas by `from_json`
Maxim Gekk created SPARK-31023: -- Summary: Support foldable schemas by `from_json` Key: SPARK-31023 URL: https://issues.apache.org/jira/browse/SPARK-31023 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, Spark accepts only literals or schema_of_json w/ literal input as the schema parameter of from_json. And it fails on any foldable expressions, for instance: {code:sql} spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 {code} There are no reasons to restrict users by literals. The ticket aims to support any foldable schemas by from_json(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31020) Support foldable schemas by `from_csv`
Maxim Gekk created SPARK-31020: -- Summary: Support foldable schemas by `from_csv` Key: SPARK-31020 URL: https://issues.apache.org/jira/browse/SPARK-31020 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, Spark accepts only literals or schema_of_csv w/ literal input as the schema parameter of from_csv. And it fails on any foldable expressions, for instance: {code:sql} spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_csv function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 {code} There is reasons to restrict users by literals. The ticket aims to support any foldable schemas by from_csv(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31005) Support time zone ids in casting strings to timestamps
Maxim Gekk created SPARK-31005: -- Summary: Support time zone ids in casting strings to timestamps Key: SPARK-31005 URL: https://issues.apache.org/jira/browse/SPARK-31005 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, Spark supports only time zone offsets in the formats: * -[h]h:[m]m * +[h]h:[m]m * Z The ticket aims to support any valid time zone ids at the end of timestamp strings, for instance: {code} 2015-03-18T12:03:17.123456 Europe/Moscow {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30988) Add more edge-case exercising values to stats tests
Maxim Gekk created SPARK-30988: -- Summary: Add more edge-case exercising values to stats tests Key: SPARK-30988 URL: https://issues.apache.org/jira/browse/SPARK-30988 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Add more edge-cases to StatisticsCollectionTestBase -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30925) Overflow/round errors in conversions of milliseconds to/from microseconds
Maxim Gekk created SPARK-30925: -- Summary: Overflow/round errors in conversions of milliseconds to/from microseconds Key: SPARK-30925 URL: https://issues.apache.org/jira/browse/SPARK-30925 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Spark has special methods in DataTimeUtils for converting microseconds from/to milliseconds - `fromMillis` and `toMillis()`. The methods handle arithmetic overflow and round negative values. The ticket aims to review all places in Spark SQL where microseconds are converted from/to milliseconds, and replace them by util methods from DateTimeUtils. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30894) The behavior of Size function should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041284#comment-17041284 ] Maxim Gekk commented on SPARK-30894: I am working on it. > The behavior of Size function should not depend on SQLConf.get > -- > > Key: SPARK-30894 > URL: https://issues.apache.org/jira/browse/SPARK-30894 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30892) Exclude spark.sql.variable.substitute.depth from removedSQLConfigs
[ https://issues.apache.org/jira/browse/SPARK-30892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30892: --- Description: The spark.sql.variable.substitute.depth SQL config is not used since Spark 2.4 inclusively. By the [https://github.com/apache/spark/pull/27169], the config was placed to SQLConf.removedSQLConfigs. And as a consequence of that when an user set it non-default value (1 for example), he/she will get an exception. It is acceptable for configs that could impact on the behavior but not for this particular config. Raising of such exception will just make migration to Spark 3.0 more difficult. (was: The spark.sql.variable.substitute.depth SQL config is not used since Spark 2.4 inclusively. By the [https://github.com/apache/spark/pull/27169], the config was placed to SQLConf.removedSQLConfigs. And as a consequence of that when an user set it non-default value (1 for example), he/she will get an exception. It is acceptable for configs that could impact on the behavior but not for this particular config. Raising of such exception will just make migration to Spark more difficult.) > Exclude spark.sql.variable.substitute.depth from removedSQLConfigs > -- > > Key: SPARK-30892 > URL: https://issues.apache.org/jira/browse/SPARK-30892 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > The spark.sql.variable.substitute.depth SQL config is not used since Spark > 2.4 inclusively. By the [https://github.com/apache/spark/pull/27169], the > config was placed to SQLConf.removedSQLConfigs. And as a consequence of that > when an user set it non-default value (1 for example), he/she will get an > exception. It is acceptable for configs that could impact on the behavior but > not for this particular config. Raising of such exception will just make > migration to Spark 3.0 more difficult. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30892) Exclude spark.sql.variable.substitute.depth from removedSQLConfigs
Maxim Gekk created SPARK-30892: -- Summary: Exclude spark.sql.variable.substitute.depth from removedSQLConfigs Key: SPARK-30892 URL: https://issues.apache.org/jira/browse/SPARK-30892 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk The spark.sql.variable.substitute.depth SQL config is not used since Spark 2.4 inclusively. By the [https://github.com/apache/spark/pull/27169], the config was placed to SQLConf.removedSQLConfigs. And as a consequence of that when an user set it non-default value (1 for example), he/she will get an exception. It is acceptable for configs that could impact on the behavior but not for this particular config. Raising of such exception will just make migration to Spark more difficult. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30858) IntegralDivide's dataType should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039433#comment-17039433 ] Maxim Gekk edited comment on SPARK-30858 at 2/18/20 8:29 PM: - The *div* function binds on this particular expression [https://github.com/apache/spark/blob/919d551ddbf7575abe7fe47d4bbba62164d6d845/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L282] . I am not sure that we can replace it during analysis. was (Author: maxgekk): The *div* function binds on this particular expressions [https://github.com/apache/spark/blob/919d551ddbf7575abe7fe47d4bbba62164d6d845/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L282] . I am not sure that we can replace it during analysis. > IntegralDivide's dataType should not depend on SQLConf.get > -- > > Key: SPARK-30858 > URL: https://issues.apache.org/jira/browse/SPARK-30858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Herman van Hövell >Priority: Blocker > > {{IntegralDivide}}'s dataType depends on the value of > {{SQLConf.get.integralDivideReturnLong}}. This is a problem because the > configuration can change between different phases of planning, and this can > silently break a query plan which can lead to crashes or data corruption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30858) IntegralDivide's dataType should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039433#comment-17039433 ] Maxim Gekk commented on SPARK-30858: The *div* function binds on this particular expressions [https://github.com/apache/spark/blob/919d551ddbf7575abe7fe47d4bbba62164d6d845/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L282] . I am not sure that we can replace it during analysis. > IntegralDivide's dataType should not depend on SQLConf.get > -- > > Key: SPARK-30858 > URL: https://issues.apache.org/jira/browse/SPARK-30858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Herman van Hövell >Priority: Blocker > > {{IntegralDivide}}'s dataType depends on the value of > {{SQLConf.get.integralDivideReturnLong}}. This is a problem because the > configuration can change between different phases of planning, and this can > silently break a query plan which can lead to crashes or data corruption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30858) IntegralDivide's dataType should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039410#comment-17039410 ] Maxim Gekk commented on SPARK-30858: > This is a problem because the configuration can change between different > phases of planning [~hvanhovell] Is the code below right solution for the problem? {code:scala} case class IntegralDivide( left: Expression, right: Expression, integralDivideReturnLong: Boolean) extends DivModLike { def this(left: Expression, right: Expression) = { this(left, right, SQLConf.get.integralDivideReturnLong) } override def dataType: DataType = if (integralDivideReturnLong) { LongType } else { left.dataType } {code} > IntegralDivide's dataType should not depend on SQLConf.get > -- > > Key: SPARK-30858 > URL: https://issues.apache.org/jira/browse/SPARK-30858 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Herman van Hövell >Priority: Blocker > > {{IntegralDivide}}'s dataType depends on the value of > {{SQLConf.get.integralDivideReturnLong}}. This is a problem because the > configuration can change between different phases of planning, and this can > silently break a query plan which can lead to crashes or data corruption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30869) Convert dates to/from timestamps in microseconds precision
Maxim Gekk created SPARK-30869: -- Summary: Convert dates to/from timestamps in microseconds precision Key: SPARK-30869 URL: https://issues.apache.org/jira/browse/SPARK-30869 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, Spark converts dates to/from "timestamp" in millisecond precision but internally Catalyst's TimestampType values are stored as microseconds since epoch. When such conversion is needed in other date-timestamp functions like DateTimeUtils.monthsBetween, the function has to convert microseconds to milliseconds and then to days, see https://github.com/apache/spark/blob/06217cfded8d32962e7c54c315f8e684eb9f0999/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L577-L580 which just brings additional overhead w/o any benefits. In later versions, it makes sense because milliseconds can be passed to TimeZone.getOffset but recently Spark switched to Java 8 time API and ZoneId. And supporting conversions to milliseconds are not needed any more. The ticket aims to replace millisToDays by microsToDays, and daysToMillis by daysToMicros in DateTimeUtils. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30865) Refactor DateTimeUtils
Maxim Gekk created SPARK-30865: -- Summary: Refactor DateTimeUtils Key: SPARK-30865 URL: https://issues.apache.org/jira/browse/SPARK-30865 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk * Move TimeZoneUTC and TimeZoneGMT to DateTimeTestUtils * Remove TimeZoneGMT because it is equal to UTC * Use ZoneId.systemDefault() instead of defaultTimeZone().toZoneId * Alias SQLDate & SQLTimestamp to internal types of DateType and TimestampType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30857) Wrong truncations of timestamps before the epoch to hours and days
[ https://issues.apache.org/jira/browse/SPARK-30857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30857: --- Description: Truncations to seconds and minutes of timestamps after the epoch are correct: {code:sql} spark-sql> select date_trunc('HOUR', '2020-02-11 00:01:02.123'), date_trunc('HOUR', '2020-02-11 00:01:02.789'); 2020-02-11 00:00:00 2020-02-11 00:00:00 {code} but truncations of timestamps before the epoch are incorrect: {code:sql} spark-sql> select date_trunc('HOUR', '1960-02-11 00:01:02.123'), date_trunc('HOUR', '1960-02-11 00:01:02.789'); 1960-02-11 01:00:00 1960-02-11 01:00:00 {code} The result must be *1960-02-11 00:00:00 1960-02-11 00:00:00* The same for the DAY level: {code:sql} spark-sql> select date_trunc('DAY', '1960-02-11 00:01:02.123'), date_trunc('DAY', '1960-02-11 00:01:02.789'); 1960-02-12 00:00:00 1960-02-12 00:00:00 {code} The result must be *1960-02-11 00:00:00 1960-02-11 00:00:00* was: Truncations to seconds and minutes of timestamps after the epoch are correct: {code:sql} spark-sql> select date_trunc('HOUR', '2020-02-11 00:01:02.123'), date_trunc('HOUR', '2020-02-11 00:01:02.789'); 2020-02-11 00:00:00 2020-02-11 00:00:00 {code} but truncations of timestamps before the epoch are incorrect: {code:sql} spark-sql> select date_trunc('HOUR', '1960-02-11 00:01:02.123'), date_trunc('HOUR', '1960-02-11 00:01:02.789'); 1960-02-11 01:00:00 1960-02-11 01:00:00 {code} The result must be *1960-02-11 00:00:00 1960-02-11 00:00:00* The same for the DAY level: {code:sql} spark-sql> select date_trunc('DAY', '1960-02-11 00:01:02.123'), date_trunc('DAY', '1960-02-11 00:01:02.789'); 1960-02-12 00:00:00 1960-02-12 00:00:00 {code} The result must be 1960-02-11 00:00:00 1960-02-11 00:00:00 > Wrong truncations of timestamps before the epoch to hours and days > -- > > Key: SPARK-30857 > URL: https://issues.apache.org/jira/browse/SPARK-30857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Maxim Gekk >Priority: Major > > Truncations to seconds and minutes of timestamps after the epoch are correct: > {code:sql} > spark-sql> select date_trunc('HOUR', '2020-02-11 00:01:02.123'), > date_trunc('HOUR', '2020-02-11 00:01:02.789'); > 2020-02-11 00:00:00 2020-02-11 00:00:00 > {code} > but truncations of timestamps before the epoch are incorrect: > {code:sql} > spark-sql> select date_trunc('HOUR', '1960-02-11 00:01:02.123'), > date_trunc('HOUR', '1960-02-11 00:01:02.789'); > 1960-02-11 01:00:00 1960-02-11 01:00:00 > {code} > The result must be *1960-02-11 00:00:00 1960-02-11 00:00:00* > The same for the DAY level: > {code:sql} > spark-sql> select date_trunc('DAY', '1960-02-11 00:01:02.123'), > date_trunc('DAY', '1960-02-11 00:01:02.789'); > 1960-02-12 00:00:00 1960-02-12 00:00:00 > {code} > The result must be *1960-02-11 00:00:00 1960-02-11 00:00:00* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org