pitrou commented on issue #44615:
URL: https://github.com/apache/arrow/issues/44615#issuecomment-2662840842

   > By the way, I have several questions regarding implementation.
   > 
   >     1. During the implementation of the extract_regex function, the writer 
was totally well-known about the StringPiece. However, the arrow array copies 
the string_view into its internal buffer.
   
   I'm not sure I understand your question, but the captured strings must 
indeed be copied into the result String array. This is required by the Arrow 
format.
   
   > 2.My question is what you expect to return for extract_regex_span? My 
suggestion is a struct which includes a fixed_size_list type for each group 
with length of two for each element and it contains an offset as the the first 
value and the length as the second value of each element. for example for the 
regex`"(?P<letter>[ab])(?P<digit>\\d)"`the struct array will be ` auto type = 
struct_({field("letter_span", fixed_size_list(int64(), 2)), field("digit_span", 
fixed_size_list(int64(), 2))});`
   
   I would suggest the following if the input is a regular string array (i.e. 
with 32-bit offsets):
   ```c++
   auto type = struct_({
     field("letter", struct_({field("start", int32()), field("length", 
int32())})),
     field("digit", struct_({field("start", int32()), field("length", 
int32())}))
   });
   ```
   and the following if the input is a large string array (i.e. with 64-bit 
offsets):
   ```c++
   auto type = struct_({
     field("letter", struct_({field("start", int64()), field("length", 
int64())})),
     field("digit", struct_({field("start", int64()), field("length", 
int64())}))
   });
   ```
   
   But I'm open to other suggestions as well. cc @jorisvandenbossche 
@zanmato1984 for ideas.
   
   >     3. I think it would be nice if I added a boolean value to 
ExtratRegexSpanOptions (the new option class for extract_regex_span) to show 
the matched string value optionally. What’s your opinion?
   
   That doesn't sound very useful IMHO. The user can define a capturing group 
for the entire regex if that's desired.
   
   >     4. There is a TODO  in 
   >            
   >              
[arrow/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc](https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2286-L2289)
   >        It seems it is completely right, as the type can be achieved from 
`out` Should I apply the changes?
   
   Yes, you can try applying them and see if that breaks anything.
   
   > 5.What does the following TODO mean? (Since I’ve read the entire code 
related to extract_regex, maybe I could solve that.)
   > 
   > 
[arrow/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc](https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2187-L2188)
   
   Ideally we would compile the regex once instead of at every kernel 
invocation (the `new RE2` call). However, we don't have a nice way of doing 
this for now. In any case, this should be done as a separate PR.
   
   > 6.When I ran include-what-you-use on ` 
scalar_string_ascii.cc`,`scalar_string_test.cc `,`api_scalar.cc` I gave several 
warnings. Should I pay attention to that?
   
   It depends, can you paste the warnings somewhere? 
   
   >     7. Should I consider separate test cases for extract_regex_span or 
locate its tests on the test cases of extract_regex (since tests are identical 
with a change in expected output)?
   
   It seems you can probably colocate the tests, though you can choose whatever 
is more convenient.
   
   > 8.What is the purpose of the two following tests for extract_regex? I 
mean, since in both cases the null bit map is set completely null, running the 
tests with any possible value is accepted.. I mean for [...]
   
   You're right, it seems these tests are broken, they deserve removing or 
fixing. Thanks for noticing this!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to