Sanjar Akhmedov created HIVE-22477:
--------------------------------------
Summary: Avro logical type timestamp conversion is slow
Key: HIVE-22477
URL: https://issues.apache.org/jira/browse/HIVE-22477
Project: Hive
Issue Type: Improvement
Affects Versions: 3.1.0
Environment: Hive 3.1.0
Reporter: Sanjar Akhmedov
We have an avro backed table with hundreds of billions timestamps. Simple
{{SELECT COUNT(*) FROM t}} query takes many hours to complete in version 3.1.0
versus tens of minutes in version 1.2.1.
Looking at the attached flamegraph of one of the yarn containers, hive is
spending most of the time throwing exceptions during avro timestamp conversion.
It is generally good idea to avoid throwing exceptions in performance critical
sections, as exception creation is an expensive operation, and potentially
repeating for many rows/values in a query can have drastic performance
implications.
Afaics there is no reason to convert numeric timestamp to a string and enter
very lenient
{{org.apache.hadoop.hive.common.type.TimestampTZUtil#parse(java.lang.String,
java.time.ZoneId)}} to do timezone conversion.
This patch changes the conversion of {{Date}} and {{Timestamp}} to
{{TimestampTZ}} such that it doesn't invoke {{parse}}.
JMH timings before:
{code:java}
Benchmark Mode Cnt Score Error
Units
TimestampTZUtilBench.convertDate avgt 2 10091.990
ns/op
TimestampTZUtilBench.convertTimestamp avgt 2 10657.596
ns/op
{code}
JMH timings after:
{code:java}
Benchmark Mode Cnt Score Error Units
TimestampTZUtilBench.convertDate avgt 2 48.371 ns/op
TimestampTZUtilBench.convertTimestamp avgt 2 51.170 ns/op
{code}
JMH stack profile before:
{code:java}
Secondary result
"org.apache.hive.benchmark.common.TimestampTZUtilBench.convertDate:·stack":
Stack profiler:
....[Thread state
distributions]....................................................................
100.0% RUNNABLE
....[Thread state:
RUNNABLE]........................................................................
97.4% 97.4% java.lang.Throwable.fillInStackTrace
1.6% 1.6% java.time.format.DateTimeFormatter.parse
0.2% 0.2% java.time.ZoneId.from
0.1% 0.1% java.util.HashMap.hash
0.1% 0.1% java.lang.Number.<init>
0.1% 0.1%
java.time.format.DateTimeFormatterBuilder$CompositePrinterParser.format
0.1% 0.1% java.lang.StringBuilder.append
0.1% 0.1% java.util.HashMap.putVal
0.1% 0.1% java.lang.String.valueOf
0.1% 0.1% java.util.regex.Pattern$BmpCharProperty.match
0.2% 0.2% <other>
...
Secondary result
"org.apache.hive.benchmark.common.TimestampTZUtilBench.convertTimestamp:·stack":
Stack profiler:
....[Thread state
distributions]....................................................................
100.0% RUNNABLE
....[Thread state:
RUNNABLE]........................................................................
96.5% 96.5% java.lang.Throwable.fillInStackTrace
1.0% 1.0% java.time.format.DateTimeFormatter.parse
0.6% 0.6% org.apache.hadoop.hive.common.type.TimestampTZUtil.parse
0.4% 0.4% java.time.ZoneId.from
0.2% 0.2%
java.time.format.DateTimeFormatterBuilder$CompositePrinterParser.format
0.2% 0.2% java.time.format.Parsed.resolveFields
0.2% 0.2% java.lang.String.valueOf
0.1% 0.1% java.lang.StringBuilder.append
0.1% 0.1% java.util.HashMap.hash
0.1% 0.1% java.time.format.DateTimeParseContext.toResolved
0.6% 0.6% <other>
{code}
JMH stack profile after:
{code:java}
Secondary result
"org.apache.hive.benchmark.common.TimestampTZUtilBench.convertDate:·stack":
Stack profiler:
....[Thread state
distributions]....................................................................
100.0% RUNNABLE
....[Thread state:
RUNNABLE]........................................................................
91.6% 91.6% java.time.ZonedDateTime.ofInstant
8.0% 8.0%
org.apache.hive.benchmark.common.generated.TimestampTZUtilBench_convertDate_jmhTest.convertDate_avgt_jmhStub
0.1% 0.1% java.time.zone.ZoneRules.<init>
0.1% 0.1% java.time.LocalDateTime.ofEpochSecond
0.1% 0.1% org.apache.hadoop.hive.common.type.TimestampTZUtil.convert
0.1% 0.1% java.time.LocalDate.ofEpochDay
0.1% 0.1% java.time.ZonedDateTime.create
...
Secondary result
"org.apache.hive.benchmark.common.TimestampTZUtilBench.convertTimestamp:·stack":
Stack profiler:
....[Thread state
distributions]....................................................................
100.0% RUNNABLE
....[Thread state:
RUNNABLE]........................................................................
90.7% 90.7% java.time.ZonedDateTime.ofInstant
9.0% 9.0%
org.apache.hive.benchmark.common.generated.TimestampTZUtilBench_convertTimestamp_jmhTest.convertTimestamp_avgt_jmhStub
0.1% 0.1% java.time.zone.ZoneRules.<init>
0.1% 0.1%
org.apache.hive.benchmark.common.generated.TimestampTZUtilBench_convertTimestamp_jmhTest.convertTimestamp_AverageTime
0.1% 0.1% java.time.LocalDateTime.ofEpochSecond
0.1% 0.1% java.time.LocalDate.ofEpochDay
0.1% 0.1% java.time.ZonedDateTime.create
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)