pitrou commented on issue #44615:
URL: https://github.com/apache/arrow/issues/44615#issuecomment-2662840842
> By the way, I have several questions regarding implementation.
>
> 1. During the implementation of the extract_regex function, the writer
was totally well-known about the StringPiece. However, the arrow array copies
the string_view into its internal buffer.
I'm not sure I understand your question, but the captured strings must
indeed be copied into the result String array. This is required by the Arrow
format.
> 2.My question is what you expect to return for extract_regex_span? My
suggestion is a struct which includes a fixed_size_list type for each group
with length of two for each element and it contains an offset as the the first
value and the length as the second value of each element. for example for the
regex`"(?P<letter>[ab])(?P<digit>\\d)"`the struct array will be ` auto type =
struct_({field("letter_span", fixed_size_list(int64(), 2)), field("digit_span",
fixed_size_list(int64(), 2))});`
I would suggest the following if the input is a regular string array (i.e.
with 32-bit offsets):
```c++
auto type = struct_({
field("letter", struct_({field("start", int32()), field("length",
int32())})),
field("digit", struct_({field("start", int32()), field("length",
int32())}))
});
```
and the following if the input is a large string array (i.e. with 64-bit
offsets):
```c++
auto type = struct_({
field("letter", struct_({field("start", int64()), field("length",
int64())})),
field("digit", struct_({field("start", int64()), field("length",
int64())}))
});
```
But I'm open to other suggestions as well. cc @jorisvandenbossche
@zanmato1984 for ideas.
> 3. I think it would be nice if I added a boolean value to
ExtratRegexSpanOptions (the new option class for extract_regex_span) to show
the matched string value optionally. What’s your opinion?
That doesn't sound very useful IMHO. The user can define a capturing group
for the entire regex if that's desired.
> 4. There is a TODO in
>
>
[arrow/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc](https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2286-L2289)
> It seems it is completely right, as the type can be achieved from
`out` Should I apply the changes?
Yes, you can try applying them and see if that breaks anything.
> 5.What does the following TODO mean? (Since I’ve read the entire code
related to extract_regex, maybe I could solve that.)
>
>
[arrow/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc](https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2187-L2188)
Ideally we would compile the regex once instead of at every kernel
invocation (the `new RE2` call). However, we don't have a nice way of doing
this for now. In any case, this should be done as a separate PR.
> 6.When I ran include-what-you-use on `
scalar_string_ascii.cc`,`scalar_string_test.cc `,`api_scalar.cc` I gave several
warnings. Should I pay attention to that?
It depends, can you paste the warnings somewhere?
> 7. Should I consider separate test cases for extract_regex_span or
locate its tests on the test cases of extract_regex (since tests are identical
with a change in expected output)?
It seems you can probably colocate the tests, though you can choose whatever
is more convenient.
> 8.What is the purpose of the two following tests for extract_regex? I
mean, since in both cases the null bit map is set completely null, running the
tests with any possible value is accepted.. I mean for [...]
You're right, it seems these tests are broken, they deserve removing or
fixing. Thanks for noticing this!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]