github-actions[bot] commented on PR #61156:
URL: https://github.com/apache/doris/pull/61156#issuecomment-4115821150

   ## Code Review Summary — PR #61156: [Feature](func) Support 
REGEXP_EXTRACT_ALL_ARRAY
   
   ### Overview
   This PR adds a new function `regexp_extract_all_array` that returns 
`Array<String>` type instead of the string-formatted array `['a','b']` returned 
by `regexp_extract_all`. The implementation is well-structured, using a 
strategy/handler pattern to share the core regex matching logic between the two 
output formats. The refactoring also consolidates 
`RegexpReplaceImpl`/`RegexpReplaceOneImpl` into a single template and lifts 
`execute_impl` logic from `FunctionRegexpFunctionality` into the `Impl` structs.
   
   ### Critical Checkpoint Conclusions
   
   1. **Goal & correctness**: The goal is to add `regexp_extract_all_array` 
returning `Array<Nullable<String>>`. The code accomplishes this. Both BE unit 
tests and regression tests cover the new function. The refactoring of existing 
functions (`regexp_replace`, `regexp_replace_one`, `regexp_extract`, 
`regexp_extract_or_null`) is a reasonable consolidation. Tests exist for the 
new function.
   
   2. **Modification scope**: The PR is focused on the new function plus a 
clean refactoring of adjacent regexp functions. The refactoring is larger than 
strictly necessary for the feature but improves code quality by eliminating 
duplication. Acceptable scope.
   
   3. **Concurrency**: No concurrency concerns. Function execution is 
single-threaded per fragment instance. The `THREAD_LOCAL` function state 
pattern is correctly followed.
   
   4. **Lifecycle management**: No special lifecycle concerns. The 
`RegexpExtractEngine` is either pre-compiled in `open()` and stored via 
`set_function_state`, or created per-row in the inner loop with `unique_ptr`. 
Both patterns are correct.
   
   5. **Configuration items**: No new configuration items added. N/A.
   
   6. **Incompatible changes / rolling upgrades**: No incompatible changes. The 
new function is additive. Existing `regexp_extract_all` behavior is preserved 
(verified by existing tests in the same file).
   
   7. **Parallel code paths**: The handler pattern 
(`RegexpExtractAllStringOutput` / `RegexpExtractAllArrayOutput`) correctly 
shares the core `regexp_extract_all_inner_loop`. Both handlers are exercised. 
The null_map check removal from `RegexpReplaceImpl::execute_impl` is correct — 
the null_map is initialized to all-zeros and the IFunction framework handles 
nullable column unwrapping.
   
   8. **Special conditional checks**: N/A.
   
   9. **Test coverage**: Good coverage.
      - BE unit tests: cover basic matching, null input propagation (null 
string, null pattern), invalid pattern handling, no-match case, empty result 
(no capturing groups).
      - Regression tests: 11 test cases covering constant patterns, column 
patterns, non-constant patterns (`concat('^', k)`), NULL inputs, empty match, 
joins with other columns. Tests use `qt_` prefix with ORDER BY for 
deterministic ordering.
      - Missing: no test for `regexp_extract_all_array` with 
`enable_extended_regex=true` (boost regex path). The existing 
`regexp_extract_all` has such tests but the array variant does not. This is 
minor since the matching engine is shared.
   
   10. **Observability**: N/A for a new scalar function.
   
   11. **Transaction / persistence**: N/A.
   
   12. **Data writes**: N/A.
   
   13. **FE-BE variable passing**: The new function is registered in 
`BuiltinScalarFunctions.java`, `RegexpExtractAllArray.java` and 
`ScalarFunctionVisitor.java`. The FE signature correctly maps to the BE 
function name `regexp_extract_all_array`. No new thrift variables needed.
   
   14. **Performance**: 
       - The `RegexpExtractAllArrayOutput::push_matches` uses `insert_data` per 
match element, which is straightforward. For very large result sets this could 
be optimized with batch insertion, but for typical regex match counts this is 
fine.
       - The `RegexpExtractAllStringOutput::push_matches` uses a pre-calculated 
buffer with manual serialization, which is an improvement over the old code 
that used string concatenation (`res += "'" + res_matches[j] + "'"`).
       - The `std::visit` + `make_bool_variant` pattern for const/non-const 
dispatch is consistent with the rest of the codebase.
   
   15. **Other observations**: No issues found. The code follows existing 
patterns in the codebase, the FE class mirrors `RegexpExtractAll.java` closely, 
and the regression tests follow the required conventions (table dropped before 
use, `qt_` prefix with ORDER BY).
   
   ### Verdict
   **No blocking issues found.** The PR is well-structured with good test 
coverage and clean refactoring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to