arashandishgar commented on issue #44615:
URL: https://github.com/apache/arrow/issues/44615#issuecomment-2658560551

   I’ve already evaluated the issue, and I’m able to implement the feature 
(this is my first contribution). Could you assign it to me? @pitrou 
   
   By the way, I have several questions regarding implementation.
   
   1. During the implementation of the extract_regex function, the writer was 
totally well-known about the StringPiece. However, the arrow array copies the 
string_view into its internal buffer.
   
https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2308
   
   2.My question is what you expect to return for extract_regex_span? My 
suggestion is a struct which includes a fixed_size_list type for each group 
with length of two for each element  and it contains an offset as the the first 
value and the length as the second value of each element. for example for the 
regex` "(?P<letter>[ab])(?P<digit>\\d)" `the struct array will be
   ` auto type = struct_({field("letter_span", fixed_size_list(int64(), 2)),
                          field("digit_span", fixed_size_list(int64(), 2))});`
   
   3. I think it would be nice if I added a boolean value to 
ExtratRegexSpanOptions (the new option class for extract_regex_span) to show 
the matched string value optionally. What’s your opinion?
   
   4. There is a TODO  in 
https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2286-L2289
   It seems it is completely right, as the type can be achieved from. outShould 
I apply the changes?
   
   5.What does the following TODO mean?  (Since I’ve read the entire code 
related to extract_regex, maybe I could solve that.)
   
https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2187-L2188
   
   6.When I ran include-what-you-use on `  scalar_string_ascii.cc 
`,`scalar_string_test.cc `,`api_scalar.cc` I gave several warnings. Should I 
pay attention to that?
   
   7. Should I consider separate test cases for extract_regex_span or locate 
its tests on the test cases of extract_regex (since tests are identical with a 
change in expected output)?
   
   8.What is the purpose of the two following tests for extract_regex? I mean, 
since in both cases the null bit map is set completely null, running the tests 
with any possible value is accepted.. I mean for 
https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L317-L326
   the following value is accepted
   `auto output = StructArray::Make(
           {this->MakeArray({"\xfc 2", "3"}), this->MakeArray({"\xfc w", "3"})},
           {field("letter", this->type()), field("digit", this->type())}, 
null_bitmap);`
   
   and for the following test
   
   
https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L373-L384
   The following value is accepted
   
   `auto output = StructArray::Make(
           {this->template MakeArray<std::string>({{"\x00", 1}, {"1", 1}}),
            this->template MakeArray<std::string>({{"\x00 2", 3}, {"2 2", 3}})},
           {field("null", this->type()), field("digit", this->type())}, 
null_bitmap);`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to