[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
xhochy commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-648653322 > The R ones probably? For these, we need to add `utf8proc` to rtools40 and rtools35 and add them to the linker line of the R build. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
xhochy commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-648649038 The R ones probably? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
xhochy commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-648648745 @kou What is the problematic CI job that shows your problem? The MinGW ones seem fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
xhochy commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645401816 > Would a lookup table in the order of 256kb (generated at runtime, not in the binary) per case mapping be acceptable for Arrow? I would find that acceptable if the mapping is only generated if needed (thus you will have a one-off payment when using a UTF8-kernel). I would though prefer if `utf8proc` could implement it just like this on their side. Can you open an issue there? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
xhochy commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645279445 Also crossreferenced this in https://github.com/JuliaStrings/utf8proc/issues/12 to make the `utf8proc` maintainers aware of what we're doing in case they are interested. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
xhochy commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645253060 The major difference between `unilib` and `utf8proc` in uppercasing a character seems to be that [unilib looks up the uppercase value directly](https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L81) wheras [utf8proc first gets a struct with all properties](https://github.com/JuliaStrings/utf8proc/blob/08fa0698639f15d07b12c0065a4494f2d504/utf8proc.c#L377) from which it extracts the uppercase value. Pre-computing the uppercase dictionary first could bring `utf8proc` en par with the performance. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
xhochy commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-644795203 > We'll need to make utf8proc a proper toolchain library, @pitrou should be able to help you with that. I can take care of that! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org