maartenbreddels opened a new pull request #8271:
URL: https://github.com/apache/arrow/pull/8271
Contains:
* split_pattern kernel with max_split and reverse option
* split_whitespace_ascii similar to Python's bytes.split
* split_whitespace_utf8 similar to Python's str.split
It should be easy to add new split methods, e.g. a regex one in the future.
This PR mostly done, except for:
* Expose the options/kernels to Python, like done in #7593
* There is 1 test commented out, because it fails for unknown reasons:
```c++
this->CheckUnary("split_pattern", R"(["foo bar", "foo", null])",
this->list_type(),
R"([["foo", "bar"], ["foo"], null])", &options);
```
Fails with the following output:
```
Failed:
Got:
[
[
[
"foo",
"bar"
]
],
[
[
"foo"
],
null
]
]
Expected:
[
[
[
"foo",
"bar"
]
],
[
[
"foo"
],
null
]
]
```
It seems the test framework does not consider these equal, possibly due to
the null value.
This also raised a question: Should both the list entry *and* the string
array entry have a missing/null value if the input string contains a null
value? I think it should because if we ask for the underlying string array, and
the string value that the missing list entry points to is not a missing value,
it will be an empty string, which seems odd to me.
In the code, I've commented out the null_builder, instead of removing it,
for reviewing, enabling it also does not pass the unit test (that was commented
out).
Similar to `match_substring`, the kernel `split_pattern` could also be
executed on BinaryArray's, but I'm not sure what the proper way is to make this
work. Instantiating new templates would increase the binary size, so maybe
there is a smarter way to do this. And possibly this should be done in a
different PR, together with `match_substring`.
cc @TomAugspurger
Related #8231
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]