maartenbreddels opened a new pull request #8271:
URL: https://github.com/apache/arrow/pull/8271


   Contains:
    * split_pattern kernel with max_split and reverse option
    * split_whitespace_ascii similar to Python's bytes.split
    * split_whitespace_utf8 similar to Python's str.split
   
   It should be easy to add new split methods, e.g. a regex one in the future.
   
   This PR mostly done, except for:
    * Expose the options/kernels to Python, like done in #7593 
    * There is 1 test commented out, because it fails for unknown reasons:
   ```c++
     this->CheckUnary("split_pattern", R"(["foo bar", "foo", null])", 
this->list_type(),
                      R"([["foo", "bar"], ["foo"], null])", &options);
   ```
   Fails with the following output:
   ```
   Failed:
   Got: 
     [
       [
         [
           "foo",
           "bar"
         ]
       ],
       [
         [
           "foo"
         ],
         null
       ]
     ]
   Expected: 
     [
       [
         [
           "foo",
           "bar"
         ]
       ],
       [
         [
           "foo"
         ],
         null
       ]
     ]
   ```
   
   It seems the test framework does not consider these equal, possibly due to 
the null value.
   
   This also raised a question: Should both the list entry *and* the string 
array entry have a missing/null value if the input string contains a null 
value? I think it should because if we ask for the underlying string array, and 
the string value that the missing list entry points to is not a missing value, 
it will be an empty string, which seems odd to me.
   
   In the code, I've commented out the null_builder, instead of removing it, 
for reviewing, enabling it also does not pass the unit test (that was commented 
out).
   
   Similar to `match_substring`, the kernel `split_pattern` could also be 
executed on BinaryArray's, but I'm not sure what the proper way is to make this 
work. Instantiating new templates would increase the binary size, so maybe 
there is a smarter way to do this. And possibly this should be done in a 
different PR, together with `match_substring`.
   
   cc @TomAugspurger 
   
   Related #8231 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to