Re: [I] [C++] Add possibility to extract spans/byte offsets directly for `compute.extract_regex` [arrow]

via GitHub Mon, 17 Feb 2025 05:03:12 -0800


arashandishgar commented on issue #44615:
URL: https://github.com/apache/arrow/issues/44615#issuecomment-2662949381


   
   > > 2.My question is what you expect to return for extract_regex_span? My 
suggestion is a struct which includes a fixed_size_list type for each group 
with length of two for each element and it contains an offset as the the first 
value and the length as the second value of each element. for example for the 
regex`"(?P<letter>[ab])(?P<digit>\\d)"`the struct array will be ` auto type = 
struct_({field("letter_span", fixed_size_list(int64(), 2)), field("digit_span", 
fixed_size_list(int64(), 2))});`
   > 
   > I would suggest the following if the input is a regular string array (i.e. 
with 32-bit offsets):
   > 
   > auto type = struct_({
   >   field("letter", struct_({field("start", int32()), field("length", 
int32())})),
   >   field("digit", struct_({field("start", int32()), field("length", 
int32())}))
   > });
   > and the following if the input is a large string array (i.e. with 64-bit 
offsets):
   > 
   > auto type = struct_({
   >   field("letter", struct_({field("start", int64()), field("length", 
int64())})),
   >   field("digit", struct_({field("start", int64()), field("length", 
int64())}))
   > });
   > But I'm open to other suggestions as well. cc 
[@jorisvandenbossche](https://github.com/jorisvandenbossche) 
[@zanmato1984](https://github.com/zanmato1984) for ideas.
   
   I think it is better to put both the start and length fields next to each 
other (consider fixed_size_list) in memory rather than in two separate memory 
buffers, as both should be used to deduce the result, and neither of them would 
be meaningful alone. (What is your opinion?) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++] Add possibility to extract spans/byte offsets directly for `compute.extract_regex` [arrow]

Reply via email to