[jira] (HIVE-15743) vectorized text parsing: speed up double parse

Gopal V (JIRA) Tue, 31 Jan 2017 12:16:03 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847429#comment-15847429
 ]


Gopal V commented on HIVE-15743:
--------------------------------


> String creation is actually more than half of the cost of going from byte[] 
> to double (see picture).

Specifically, it is a TLAB alloc miss. The code which is there is already ~10x 
faster, but we can do better by giving up String & assuming utf8 bytes always. 

Also a MutableDouble::parse() would let the system return a (Success, Value) 
tuple, which should allow for the fall back re-execution pathway to kick in for 
any failures.

> if it's rare and easy to detect, is to handle the 99% cases fast, and fall 
> back to Double.parse(new String) in exotic/rare cases.

I ran through all the data examples I have from various cases. The largest 
number of digits in raw data was 18 digits (15,2), with the most common Decimal 
source pattern hovering around (9,2).

None of them would be hit by a 2 ULP error, but we can always fall back to 
original parser for digits > 18.

> vectorized text parsing: speed up double parse
> ----------------------------------------------
>
>                 Key: HIVE-15743
>                 URL: https://issues.apache.org/jira/browse/HIVE-15743
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Teddy Choi
>         Attachments: HIVE-15743.1.patch, HIVE-15743.2.patch, tpch-without.png
>
>
> {noformat}
> Double.parseDouble(
>                 new String(bytes, fieldStart, fieldLength, 
> StandardCharsets.UTF_8));{noformat}
> This takes ~25% of the query time in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] (HIVE-15743) vectorized text parsing: speed up double parse

Reply via email to