[ https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377194#comment-17377194 ]
Eduardo Ponce commented on ARROW-13259: --------------------------------------- [In C++ by default {{SliceOptions}} has the {{stop}} option set to {{std::numeric_limits<int64_t>::max()}}|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h#L205-L206]. Therefore, if you want to slice to end of string simply omit a value for {{stop}} or set it to a value >= len(string). {code:c++} // start=-5, stop=std::numeric_limits<int64_t>::max(), step=1 SliceOptions opts(-5); auto result = CallFunction("utf8_slice_codeunits", {Datum("Apache Arrow")}, &opts); if (result.ok()) { Datum slice = std::move(result).ValueOrDie(); // Prints "Arrow" std::cout << slice.scalar()->ToString() << std::endl; } else { ARROW_LOG(ERROR) << result.status(); } {code} In R you should be able to do the following, {code:r} # C++ version > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options > = list(start=-5L)) [1] "Arrow" {code} [~jorisvandenbossche] The issue in PyArrow arises because the [interface for {{SliceOptions}} does not sets the default value for {{stop}} option (only for {{step}} option)|https://github.com/apache/arrow/blob/master/python/pyarrow/_compute.pyx#L798]. Therefore, these are required arguments. {code:python} >>> string = 'Apache Arrow' >>> pc.utf8_slice_codeunits(string, start=-5, stop=len(string)) <pyarrow.StringScalar: 'Arrow'> {code} [By providing {{sys.maxsize}} as default {{stop}} option|https://github.com/edponce/arrow/blob/ARROW-13259-Enable-slicing-to-end-of-string-using-ut/python/pyarrow/_compute.pyx#L800-L802], we can do the following: {code:python} >>> string = 'Apache Arrow' >>> pc.utf8_slice_codeunits(string, start=-5) <pyarrow.StringScalar: 'Arrow'> {code} The question that naturally follows from this JIRA is: *Are all the default options in PyArrow and R bindings consistent with C++ defaults?* > [C++] Enable slicing to end of string using "utf8_slice_codeunits" when > string length unknown or different lengths > ------------------------------------------------------------------------------------------------------------------- > > Key: ARROW-13259 > URL: https://issues.apache.org/jira/browse/ARROW-13259 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Nic Crane > Priority: Major > > We're currently trying to write bindings from the C++ function > "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour > of R's string::str_sub > In both the R and C++ implementations, I can use negative indices to count > back from the end of a string (show below in R, but the latter directly > invokes the C++ implementation): > > {code:java} > # stringr version > > stringr::str_sub("Apache Arrow", -5, -2) > [1] "Arro" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=-1L)) > Scalar > Arro{code} > Note that in the C++ implementation, I have to add 1 to the stop value as the > final value is non-inclusive. > The problem is when I'm trying to use negative indices to refer to the final > values in a string: > > {code:java} > stringr version > > stringr::str_sub("Apache Arrow", -5, -1) > [1] "Arrow" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=0L)) > Scalar > {code} > The result is blank as the 'stop' value 0 refers to the start of the string, > effective walking backwards, which isn't possible (except via the step > argument which I can't get working but I don't think is what I want anyway). > I've tried to get around this by attempting to write some code that > calculates the length of the string and supply that to the stop argument, but > it didn't work. > I do have a possible workaround that involves reversing the string, > extracting the substring using inverted values of swapped stop/start values, > and then reversing the result, but before I go down that path, I was > wondering if there is anything that can (and should! the answer may be a > simple "nope!") be changed in the C++ code to make it possible to do this a > different way? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)