[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths
[ https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377248#comment-17377248 ] Nic Crane commented on ARROW-13259: --- [~edponce] Thanks for that clarification, I'd totally missed that! [~pachamaltese] - totally missed this in my initial review of the code, but the thing that actually needs changing is the bindings in `compute.cpp` - here, start and stop have been set to 1 and -1 respectively, but instead need to reflect the default values from here: [https://github.com/apache/arrow/blob/7eea2f53a1002552bbb87db5611e75c15b88b504/cpp/src/arrow/compute/api_scalar.h#L203-L210] I think that the `step` argument also needs implementing too. We really should write this up (I can add it to my to-do list!) as it's neither obvious nor trivial to work out the various steps required here. > [C++] Enable slicing to end of string using "utf8_slice_codeunits" when > string length unknown or different lengths > --- > > Key: ARROW-13259 > URL: https://issues.apache.org/jira/browse/ARROW-13259 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > We're currently trying to write bindings from the C++ function > "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour > of R's string::str_sub > In both the R and C++ implementations, I can use negative indices to count > back from the end of a string (show below in R, but the latter directly > invokes the C++ implementation): > > {code:java} > # stringr version > > stringr::str_sub("Apache Arrow", -5, -2) > [1] "Arro" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=-1L)) > Scalar > Arro{code} > Note that in the C++ implementation, I have to add 1 to the stop value as the > final value is non-inclusive. > The problem is when I'm trying to use negative indices to refer to the final > values in a string: > > {code:java} > stringr version > > stringr::str_sub("Apache Arrow", -5, -1) > [1] "Arrow" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=0L)) > Scalar > {code} > The result is blank as the 'stop' value 0 refers to the start of the string, > effective walking backwards, which isn't possible (except via the step > argument which I can't get working but I don't think is what I want anyway). > I've tried to get around this by attempting to write some code that > calculates the length of the string and supply that to the stop argument, but > it didn't work. > I do have a possible workaround that involves reversing the string, > extracting the substring using inverted values of swapped stop/start values, > and then reversing the result, but before I go down that path, I was > wondering if there is anything that can (and should! the answer may be a > simple "nope!") be changed in the C++ code to make it possible to do this a > different way? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths
[ https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377220#comment-17377220 ] Eduardo Ponce commented on ARROW-13259: --- Created [ARROW-13288|https://issues.apache.org/jira/browse/ARROW-13288] to verify inconsistencies between the C++ and Python kernel options. > [C++] Enable slicing to end of string using "utf8_slice_codeunits" when > string length unknown or different lengths > --- > > Key: ARROW-13259 > URL: https://issues.apache.org/jira/browse/ARROW-13259 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > We're currently trying to write bindings from the C++ function > "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour > of R's string::str_sub > In both the R and C++ implementations, I can use negative indices to count > back from the end of a string (show below in R, but the latter directly > invokes the C++ implementation): > > {code:java} > # stringr version > > stringr::str_sub("Apache Arrow", -5, -2) > [1] "Arro" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=-1L)) > Scalar > Arro{code} > Note that in the C++ implementation, I have to add 1 to the stop value as the > final value is non-inclusive. > The problem is when I'm trying to use negative indices to refer to the final > values in a string: > > {code:java} > stringr version > > stringr::str_sub("Apache Arrow", -5, -1) > [1] "Arrow" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=0L)) > Scalar > {code} > The result is blank as the 'stop' value 0 refers to the start of the string, > effective walking backwards, which isn't possible (except via the step > argument which I can't get working but I don't think is what I want anyway). > I've tried to get around this by attempting to write some code that > calculates the length of the string and supply that to the stop argument, but > it didn't work. > I do have a possible workaround that involves reversing the string, > extracting the substring using inverted values of swapped stop/start values, > and then reversing the result, but before I go down that path, I was > wondering if there is anything that can (and should! the answer may be a > simple "nope!") be changed in the C++ code to make it possible to do this a > different way? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths
[ https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377194#comment-17377194 ] Eduardo Ponce commented on ARROW-13259: --- [In C++ by default {{SliceOptions}} has the {{stop}} option set to {{std::numeric_limits::max()}}|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h#L205-L206]. Therefore, if you want to slice to end of string simply omit a value for {{stop}} or set it to a value >= len(string). {code:c++} // start=-5, stop=std::numeric_limits::max(), step=1 SliceOptions opts(-5); auto result = CallFunction("utf8_slice_codeunits", {Datum("Apache Arrow")}, ); if (result.ok()) { Datum slice = std::move(result).ValueOrDie(); // Prints "Arrow" std::cout << slice.scalar()->ToString() << std::endl; } else { ARROW_LOG(ERROR) << result.status(); } {code} In R you should be able to do the following, {code:r} # C++ version > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options > = list(start=-5L)) [1] "Arrow" {code} [~jorisvandenbossche] The issue in PyArrow arises because the [interface for {{SliceOptions}} does not sets the default value for {{stop}} option (only for {{step}} option)|https://github.com/apache/arrow/blob/master/python/pyarrow/_compute.pyx#L798]. Therefore, these are required arguments. {code:python} >>> string = 'Apache Arrow' >>> pc.utf8_slice_codeunits(string, start=-5, stop=len(string)) {code} [By providing {{sys.maxsize}} as default {{stop}} option|https://github.com/edponce/arrow/blob/ARROW-13259-Enable-slicing-to-end-of-string-using-ut/python/pyarrow/_compute.pyx#L800-L802], we can do the following: {code:python} >>> string = 'Apache Arrow' >>> pc.utf8_slice_codeunits(string, start=-5) {code} The question that naturally follows from this JIRA is: *Are all the default options in PyArrow and R bindings consistent with C++ defaults?* > [C++] Enable slicing to end of string using "utf8_slice_codeunits" when > string length unknown or different lengths > --- > > Key: ARROW-13259 > URL: https://issues.apache.org/jira/browse/ARROW-13259 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > We're currently trying to write bindings from the C++ function > "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour > of R's string::str_sub > In both the R and C++ implementations, I can use negative indices to count > back from the end of a string (show below in R, but the latter directly > invokes the C++ implementation): > > {code:java} > # stringr version > > stringr::str_sub("Apache Arrow", -5, -2) > [1] "Arro" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=-1L)) > Scalar > Arro{code} > Note that in the C++ implementation, I have to add 1 to the stop value as the > final value is non-inclusive. > The problem is when I'm trying to use negative indices to refer to the final > values in a string: > > {code:java} > stringr version > > stringr::str_sub("Apache Arrow", -5, -1) > [1] "Arrow" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=0L)) > Scalar > {code} > The result is blank as the 'stop' value 0 refers to the start of the string, > effective walking backwards, which isn't possible (except via the step > argument which I can't get working but I don't think is what I want anyway). > I've tried to get around this by attempting to write some code that > calculates the length of the string and supply that to the stop argument, but > it didn't work. > I do have a possible workaround that involves reversing the string, > extracting the substring using inverted values of swapped stop/start values, > and then reversing the result, but before I go down that path, I was > wondering if there is anything that can (and should! the answer may be a > simple "nope!") be changed in the C++ code to make it possible to do this a > different way? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths
[ https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374939#comment-17374939 ] Mauricio 'Pachá' Vargas Sepúlveda commented on ARROW-13259: --- thanks a lot, I've edited my PR since I'm on 21.04, I'm considering doing my work on a virtual machine until the build is fixed > [C++] Enable slicing to end of string using "utf8_slice_codeunits" when > string length unknown or different lengths > --- > > Key: ARROW-13259 > URL: https://issues.apache.org/jira/browse/ARROW-13259 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > We're currently trying to write bindings from the C++ function > "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour > of R's string::str_sub > In both the R and C++ implementations, I can use negative indices to count > back from the end of a string (show below in R, but the latter directly > invokes the C++ implementation): > > {code:java} > # stringr version > > stringr::str_sub("Apache Arrow", -5, -2) > [1] "Arro" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=-1L)) > Scalar > Arro{code} > Note that in the C++ implementation, I have to add 1 to the stop value as the > final value is non-inclusive. > The problem is when I'm trying to use negative indices to refer to the final > values in a string: > > {code:java} > stringr version > > stringr::str_sub("Apache Arrow", -5, -1) > [1] "Arrow" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=0L)) > Scalar > {code} > The result is blank as the 'stop' value 0 refers to the start of the string, > effective walking backwards, which isn't possible (except via the step > argument which I can't get working but I don't think is what I want anyway). > I've tried to get around this by attempting to write some code that > calculates the length of the string and supply that to the stop argument, but > it didn't work. > I do have a possible workaround that involves reversing the string, > extracting the substring using inverted values of swapped stop/start values, > and then reversing the result, but before I go down that path, I was > wondering if there is anything that can (and should! the answer may be a > simple "nope!") be changed in the C++ code to make it possible to do this a > different way? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths
[ https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374924#comment-17374924 ] Nic Crane commented on ARROW-13259: --- Thanks very much [~maartenbreddels] and [~jorisvandenbossche] ! [~lidavidm] - nah, it's fine, I can just copy from the Python implementation and chuck in some R code like {code:java} if(stop==-1)stop = .Machine$integer.max{code} CC [~pachamaltese] > [C++] Enable slicing to end of string using "utf8_slice_codeunits" when > string length unknown or different lengths > --- > > Key: ARROW-13259 > URL: https://issues.apache.org/jira/browse/ARROW-13259 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > We're currently trying to write bindings from the C++ function > "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour > of R's string::str_sub > In both the R and C++ implementations, I can use negative indices to count > back from the end of a string (show below in R, but the latter directly > invokes the C++ implementation): > > {code:java} > # stringr version > > stringr::str_sub("Apache Arrow", -5, -2) > [1] "Arro" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=-1L)) > Scalar > Arro{code} > Note that in the C++ implementation, I have to add 1 to the stop value as the > final value is non-inclusive. > The problem is when I'm trying to use negative indices to refer to the final > values in a string: > > {code:java} > stringr version > > stringr::str_sub("Apache Arrow", -5, -1) > [1] "Arrow" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=0L)) > Scalar > {code} > The result is blank as the 'stop' value 0 refers to the start of the string, > effective walking backwards, which isn't possible (except via the step > argument which I can't get working but I don't think is what I want anyway). > I've tried to get around this by attempting to write some code that > calculates the length of the string and supply that to the stop argument, but > it didn't work. > I do have a possible workaround that involves reversing the string, > extracting the substring using inverted values of swapped stop/start values, > and then reversing the result, but before I go down that path, I was > wondering if there is anything that can (and should! the answer may be a > simple "nope!") be changed in the C++ code to make it possible to do this a > different way? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths
[ https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374908#comment-17374908 ] David Li commented on ARROW-13259: -- Maybe we could add a SliceOptions::kEnd constant just to make it clear what to do? (Not sure that'd help R?) > [C++] Enable slicing to end of string using "utf8_slice_codeunits" when > string length unknown or different lengths > --- > > Key: ARROW-13259 > URL: https://issues.apache.org/jira/browse/ARROW-13259 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > We're currently trying to write bindings from the C++ function > "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour > of R's string::str_sub > In both the R and C++ implementations, I can use negative indices to count > back from the end of a string (show below in R, but the latter directly > invokes the C++ implementation): > > {code:java} > # stringr version > > stringr::str_sub("Apache Arrow", -5, -2) > [1] "Arro" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=-1L)) > Scalar > Arro{code} > Note that in the C++ implementation, I have to add 1 to the stop value as the > final value is non-inclusive. > The problem is when I'm trying to use negative indices to refer to the final > values in a string: > > {code:java} > stringr version > > stringr::str_sub("Apache Arrow", -5, -1) > [1] "Arrow" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=0L)) > Scalar > {code} > The result is blank as the 'stop' value 0 refers to the start of the string, > effective walking backwards, which isn't possible (except via the step > argument which I can't get working but I don't think is what I want anyway). > I've tried to get around this by attempting to write some code that > calculates the length of the string and supply that to the stop argument, but > it didn't work. > I do have a possible workaround that involves reversing the string, > extracting the substring using inverted values of swapped stop/start values, > and then reversing the result, but before I go down that path, I was > wondering if there is anything that can (and should! the answer may be a > simple "nope!") be changed in the C++ code to make it possible to do this a > different way? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths
[ https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374907#comment-17374907 ] Joris Van den Bossche commented on ARROW-13259: --- To copy over the practical example: {code} In [24]: import sys In [25]: string = "Apache Arrow" In [26]: pc.utf8_slice_codeunits(string, start=-5, stop=sys.maxsize) Out[26]: In [27]: pc.utf8_slice_codeunits(string, start=-5, stop=-1) Out[27]: {code} So "a large integer" can be used to indicate "slice until the end" (I suppose because you can never have a scalar string with a longer length than that value?). In Python this is {{sys.maxsize}}, in C++ it's {{std::numeric_limits::max()}}. > [C++] Enable slicing to end of string using "utf8_slice_codeunits" when > string length unknown or different lengths > --- > > Key: ARROW-13259 > URL: https://issues.apache.org/jira/browse/ARROW-13259 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > We're currently trying to write bindings from the C++ function > "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour > of R's string::str_sub > In both the R and C++ implementations, I can use negative indices to count > back from the end of a string (show below in R, but the latter directly > invokes the C++ implementation): > > {code:java} > # stringr version > > stringr::str_sub("Apache Arrow", -5, -2) > [1] "Arro" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=-1L)) > Scalar > Arro{code} > Note that in the C++ implementation, I have to add 1 to the stop value as the > final value is non-inclusive. > The problem is when I'm trying to use negative indices to refer to the final > values in a string: > > {code:java} > stringr version > > stringr::str_sub("Apache Arrow", -5, -1) > [1] "Arrow" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=0L)) > Scalar > {code} > The result is blank as the 'stop' value 0 refers to the start of the string, > effective walking backwards, which isn't possible (except via the step > argument which I can't get working but I don't think is what I want anyway). > I've tried to get around this by attempting to write some code that > calculates the length of the string and supply that to the stop argument, but > it didn't work. > I do have a possible workaround that involves reversing the string, > extracting the substring using inverted values of swapped stop/start values, > and then reversing the result, but before I go down that path, I was > wondering if there is anything that can (and should! the answer may be a > simple "nope!") be changed in the C++ code to make it possible to do this a > different way? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths
[ https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374890#comment-17374890 ] Maarten Breddels commented on ARROW-13259: -- Does my comment [https://github.com/apache/arrow/pull/9000#issue-544990164] help you out? > [C++] Enable slicing to end of string using "utf8_slice_codeunits" when > string length unknown or different lengths > --- > > Key: ARROW-13259 > URL: https://issues.apache.org/jira/browse/ARROW-13259 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nic Crane >Priority: Major > > We're currently trying to write bindings from the C++ function > "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour > of R's string::str_sub > In both the R and C++ implementations, I can use negative indices to count > back from the end of a string (show below in R, but the latter directly > invokes the C++ implementation): > > {code:java} > # stringr version > > stringr::str_sub("Apache Arrow", -5, -2) > [1] "Arro" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=-1L)) > Scalar > Arro{code} > Note that in the C++ implementation, I have to add 1 to the stop value as the > final value is non-inclusive. > The problem is when I'm trying to use negative indices to refer to the final > values in a string: > > {code:java} > stringr version > > stringr::str_sub("Apache Arrow", -5, -1) > [1] "Arrow" > # C++ version > > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), > > options = list(start=-5L, stop=0L)) > Scalar > {code} > The result is blank as the 'stop' value 0 refers to the start of the string, > effective walking backwards, which isn't possible (except via the step > argument which I can't get working but I don't think is what I want anyway). > I've tried to get around this by attempting to write some code that > calculates the length of the string and supply that to the stop argument, but > it didn't work. > I do have a possible workaround that involves reversing the string, > extracting the substring using inverted values of swapped stop/start values, > and then reversing the result, but before I go down that path, I was > wondering if there is anything that can (and should! the answer may be a > simple "nope!") be changed in the C++ code to make it possible to do this a > different way? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)