github-actions[bot] commented on PR #61156:
URL: https://github.com/apache/doris/pull/61156#issuecomment-4115821150
## Code Review Summary — PR #61156: [Feature](func) Support
REGEXP_EXTRACT_ALL_ARRAY
### Overview
This PR adds a new function `regexp_extract_all_array` that returns
`Array<String>` type instead of the string-formatted array `['a','b']` returned
by `regexp_extract_all`. The implementation is well-structured, using a
strategy/handler pattern to share the core regex matching logic between the two
output formats. The refactoring also consolidates
`RegexpReplaceImpl`/`RegexpReplaceOneImpl` into a single template and lifts
`execute_impl` logic from `FunctionRegexpFunctionality` into the `Impl` structs.
### Critical Checkpoint Conclusions
1. **Goal & correctness**: The goal is to add `regexp_extract_all_array`
returning `Array<Nullable<String>>`. The code accomplishes this. Both BE unit
tests and regression tests cover the new function. The refactoring of existing
functions (`regexp_replace`, `regexp_replace_one`, `regexp_extract`,
`regexp_extract_or_null`) is a reasonable consolidation. Tests exist for the
new function.
2. **Modification scope**: The PR is focused on the new function plus a
clean refactoring of adjacent regexp functions. The refactoring is larger than
strictly necessary for the feature but improves code quality by eliminating
duplication. Acceptable scope.
3. **Concurrency**: No concurrency concerns. Function execution is
single-threaded per fragment instance. The `THREAD_LOCAL` function state
pattern is correctly followed.
4. **Lifecycle management**: No special lifecycle concerns. The
`RegexpExtractEngine` is either pre-compiled in `open()` and stored via
`set_function_state`, or created per-row in the inner loop with `unique_ptr`.
Both patterns are correct.
5. **Configuration items**: No new configuration items added. N/A.
6. **Incompatible changes / rolling upgrades**: No incompatible changes. The
new function is additive. Existing `regexp_extract_all` behavior is preserved
(verified by existing tests in the same file).
7. **Parallel code paths**: The handler pattern
(`RegexpExtractAllStringOutput` / `RegexpExtractAllArrayOutput`) correctly
shares the core `regexp_extract_all_inner_loop`. Both handlers are exercised.
The null_map check removal from `RegexpReplaceImpl::execute_impl` is correct —
the null_map is initialized to all-zeros and the IFunction framework handles
nullable column unwrapping.
8. **Special conditional checks**: N/A.
9. **Test coverage**: Good coverage.
- BE unit tests: cover basic matching, null input propagation (null
string, null pattern), invalid pattern handling, no-match case, empty result
(no capturing groups).
- Regression tests: 11 test cases covering constant patterns, column
patterns, non-constant patterns (`concat('^', k)`), NULL inputs, empty match,
joins with other columns. Tests use `qt_` prefix with ORDER BY for
deterministic ordering.
- Missing: no test for `regexp_extract_all_array` with
`enable_extended_regex=true` (boost regex path). The existing
`regexp_extract_all` has such tests but the array variant does not. This is
minor since the matching engine is shared.
10. **Observability**: N/A for a new scalar function.
11. **Transaction / persistence**: N/A.
12. **Data writes**: N/A.
13. **FE-BE variable passing**: The new function is registered in
`BuiltinScalarFunctions.java`, `RegexpExtractAllArray.java` and
`ScalarFunctionVisitor.java`. The FE signature correctly maps to the BE
function name `regexp_extract_all_array`. No new thrift variables needed.
14. **Performance**:
- The `RegexpExtractAllArrayOutput::push_matches` uses `insert_data` per
match element, which is straightforward. For very large result sets this could
be optimized with batch insertion, but for typical regex match counts this is
fine.
- The `RegexpExtractAllStringOutput::push_matches` uses a pre-calculated
buffer with manual serialization, which is an improvement over the old code
that used string concatenation (`res += "'" + res_matches[j] + "'"`).
- The `std::visit` + `make_bool_variant` pattern for const/non-const
dispatch is consistent with the rest of the codebase.
15. **Other observations**: No issues found. The code follows existing
patterns in the codebase, the FE class mirrors `RegexpExtractAll.java` closely,
and the regression tests follow the required conventions (table dropped before
use, `qt_` prefix with ORDER BY).
### Verdict
**No blocking issues found.** The PR is well-structured with good test
coverage and clean refactoring.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]