augustoasilva commented on a change in pull request #11551:
URL: https://github.com/apache/arrow/pull/11551#discussion_r755009301
##########
File path: cpp/src/gandiva/gdv_function_stubs.cc
##########
@@ -794,6 +812,188 @@ const char* gdv_fn_initcap_utf8(int64_t context, const
char* data, int32_t data_
*out_len = out_idx;
return out;
}
+
+GANDIVA_EXPORT
+const char* gdv_mask_first_n_utf8_int32(int64_t context, const char* data,
+ int32_t data_len, int32_t n_to_mask,
+ int32_t* out_len) {
+ if (data_len <= 0) {
+ *out_len = 0;
+ return nullptr;
+ }
+
+ *out_len = data_len;
+
+ char* out = reinterpret_cast<char*>(gdv_fn_context_arena_malloc(context,
*out_len));
+ if (out == nullptr) {
+ gdv_fn_context_set_error_msg(context, "Could not allocate memory for
output string");
+ *out_len = 0;
+ return nullptr;
+ }
+
+ if (n_to_mask < 0) {
+ memcpy(out, data, data_len);
+ return out;
+ }
+
+ int num_masked;
+ for (num_masked = 0; num_masked < n_to_mask; num_masked++) {
+ unsigned char char_single_byte = data[num_masked];
+ if (char_single_byte > 127) {
+ // found a multi-byte utf-8 char
+ break;
+ }
+ out[num_masked] = mask_array[char_single_byte];
+ }
+
+ utf8proc_int32_t utf8_char;
+ int char_counter = num_masked;
Review comment:
The condition for while loop cant be (num_masked < n_to_mask) because
num_masked counts bytes masked till now, not exactly the number of chars. Eg.:
if the char 'รง' has been masked, the num_masked will be increased by 2, as this
char has 2 bytes. The char counters is the real counter for the characters
masked. it will have the initial value of num_masked because until a char
bigger than 127 is found, each char has the size of 1 byte.
But I will think of a better naming for it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]