[
https://issues.apache.org/jira/browse/HIVE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664572#comment-13664572
]
Eric Hanson commented on HIVE-4548:
-----------------------------------
It appears that all the specific characters you are checking for in
parseSimplePattern (%, _, \) cannot be the first or last character of a
surrogate pair. So I think the code is safe. Please think this through and add
some unit tests that process multi-byte UTF-8 characters of 3 bytes or more
(which will force encoding as surrogate pairs inside a String).
See
http://en.wikipedia.org/wiki/UTF-16/UCS-2#Code_points_U.2B10000_to_U.2B10FFFF
for a discussion of surrogate pairs.
See http://en.wikipedia.org/wiki/List_of_Unicode_characters for a list of
Unicode characters. % is 0x0025, _ is 0x005F, and \ is 0x005C. Surrogate pairs
are all have lead surrogates in the range 0xD800..0xDBFF and trail surrogates
in the range 0xDC00..0xDFFF.
> Speed up vectorized LIKE filter for special cases abc%, %abc and %abc%
> ----------------------------------------------------------------------
>
> Key: HIVE-4548
> URL: https://issues.apache.org/jira/browse/HIVE-4548
> Project: Hive
> Issue Type: Sub-task
> Affects Versions: vectorization-branch
> Reporter: Eric Hanson
> Assignee: Teddy Choi
> Priority: Minor
> Fix For: vectorization-branch
>
> Attachments: HIVE-4548.1-with-benchmark.patch.txt,
> HIVE-4548.1-without-benchmark.patch.txt,
> HIVE-4548.2-with-benchmark.patch.txt, HIVE-4548.2-without-benchmark.patch.txt
>
>
> Speed up vectorized LIKE filter evaluation for abc%, %abc, and %abc% pattern
> special cases (here, abc is just a place holder for some fixed string).
>
> Problem: The current vectorized LIKE implementation always calls the standard
> LIKE function code in UDFLike.java. But this is pretty expensive. It calls
> multiple functions and allocates at least one new object per call. Probably
> 80% of uses of LIKE are for the simple patterns abc%, %abc, and %abc%. These
> can be implemented much more efficiently.
> Start by speeding up the case for
> Column LIKE "abc%"
>
> The goal would be to minimize expense in the inner loop. Don't use new() in
> the inner loop, and write a static function that checks the prefix of the
> string matches the like pattern as efficiently as possible, operating
> directly on the byte array holding UTF-8-encoded string data, and avoiding
> unnecessary additional function calls and if/else logic. Call that in the
> inner loop.
> If feasible, consider using a template-driven approach, with an instance of
> the template expanded for each of the three cases. Start doing the abc%
> (prefix match) by hand, then consider templatizing for the other two cases.
> The code is in the "vectorization" branch of the main hive repo.
>
> Start by checking in the constructor for FilterStringColLikeStringScalar.java
> if the pattern is one of the simple special cases. If so, record that, and
> have the evaluate() method call a special-case function for each case, i.e.
> the general case, and each of the 3 special cases. All the dynamic
> decision-making would be done once per vector, not once per element.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira