arashandishgar commented on issue #44615: URL: https://github.com/apache/arrow/issues/44615#issuecomment-2658560551
I’ve already evaluated the issue, and I’m able to implement the feature (this is my first contribution). Could you assign it to me? @pitrou By the way, I have several questions regarding implementation. 1. During the implementation of the extract_regex function, the writer was totally well-known about the StringPiece. However, the arrow array copies the string_view into its internal buffer. https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2308 2.My question is what you expect to return for extract_regex_span? My suggestion is a struct which includes a fixed_size_list type for each group with length of two for each element and it contains an offset as the the first value and the length as the second value of each element. for example for the regex` "(?P<letter>[ab])(?P<digit>\\d)" `the struct array will be ` auto type = struct_({field("letter_span", fixed_size_list(int64(), 2)), field("digit_span", fixed_size_list(int64(), 2))});` 3. I think it would be nice if I added a boolean value to ExtratRegexSpanOptions (the new option class for extract_regex_span) to show the matched string value optionally. What’s your opinion? 4. There is a TODO in https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2286-L2289 It seems it is completely right, as the type can be achieved from. outShould I apply the changes? 5.What does the following TODO mean? (Since I’ve read the entire code related to extract_regex, maybe I could solve that.) https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2187-L2188 6.When I ran include-what-you-use on ` scalar_string_ascii.cc `,`scalar_string_test.cc `,`api_scalar.cc` I gave several warnings. Should I pay attention to that? 7. Should I consider separate test cases for extract_regex_span or locate its tests on the test cases of extract_regex (since tests are identical with a change in expected output)? 8.What is the purpose of the two following tests for extract_regex? I mean, since in both cases the null bit map is set completely null, running the tests with any possible value is accepted.. I mean for https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L317-L326 the following value is accepted `auto output = StructArray::Make( {this->MakeArray({"\xfc 2", "3"}), this->MakeArray({"\xfc w", "3"})}, {field("letter", this->type()), field("digit", this->type())}, null_bitmap);` and for the following test https://github.com/apache/arrow/blob/a931affa1766be1f98979228ca6ab124aa7f793c/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L373-L384 The following value is accepted `auto output = StructArray::Make( {this->template MakeArray<std::string>({{"\x00", 1}, {"1", 1}}), this->template MakeArray<std::string>({{"\x00 2", 3}, {"2 2", 3}})}, {field("null", this->type()), field("digit", this->type())}, null_bitmap);` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
