[ https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899825#comment-16899825 ]
Antoine Pitrou commented on ARROW-6131: --------------------------------------- I expect all-ASCII data to be very frequent in the kind of data that Arrow processes (e.g. data coming from CSV files). I'm also rather wary of maintaining delicate non-trivial SIMD code in Arrow. [~wesmckinn] > [C++] Optimize the Arrow UTF-8-string-validation > ------------------------------------------------- > > Key: ARROW-6131 > URL: https://issues.apache.org/jira/browse/ARROW-6131 > Project: Apache Arrow > Issue Type: Improvement > Reporter: Yuqi Gu > Assignee: Yuqi Gu > Priority: Major > > The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE) > Range base algorithm: > 1. Map each byte of input-string to Range table. > 2. Leverage the Neon 'tbl' instruction to lookup table. > 3. Find the pattern and set correct table index for each input byte > 4. Validate input string. > The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii > and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases > (The input data is all ascii string). > The benchmark API is > {code:java} > ValidateUTF8 > {code} > As far as I know, the data that is all-ascii is unusual on the internet. > Could you guys please tell me what's the use case scenario for Apache Arrow? > Is the Arrow's data that need to be validated all-ascii string? > If not, I'd like to submit the patch to accelerate the NonAscii validation. > As for All-Ascii validation, I would like to propose another optimization > solution with SIMD in another jira. -- This message was sent by Atlassian JIRA (v7.6.14#76016)