arashandishgar commented on issue #44615:
URL: https://github.com/apache/arrow/issues/44615#issuecomment-2662949381
> > 2.My question is what you expect to return for extract_regex_span? My
suggestion is a struct which includes a fixed_size_list type for each group
with length of two for each element and it contains an offset as the the first
value and the length as the second value of each element. for example for the
regex`"(?P<letter>[ab])(?P<digit>\\d)"`the struct array will be ` auto type =
struct_({field("letter_span", fixed_size_list(int64(), 2)), field("digit_span",
fixed_size_list(int64(), 2))});`
>
> I would suggest the following if the input is a regular string array (i.e.
with 32-bit offsets):
>
> auto type = struct_({
> field("letter", struct_({field("start", int32()), field("length",
int32())})),
> field("digit", struct_({field("start", int32()), field("length",
int32())}))
> });
> and the following if the input is a large string array (i.e. with 64-bit
offsets):
>
> auto type = struct_({
> field("letter", struct_({field("start", int64()), field("length",
int64())})),
> field("digit", struct_({field("start", int64()), field("length",
int64())}))
> });
> But I'm open to other suggestions as well. cc
[@jorisvandenbossche](https://github.com/jorisvandenbossche)
[@zanmato1984](https://github.com/zanmato1984) for ideas.
I think it is better to put both the start and length fields next to each
other (consider fixed_size_list) in memory rather than in two separate memory
buffers, as both should be used to deduce the result, and neither of them would
be meaningful alone. (What is your opinion?)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]