[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths

2021-07-08 Thread Nic Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377248#comment-17377248
 ] 

Nic Crane commented on ARROW-13259:
---

[~edponce] Thanks for that clarification, I'd totally missed that!

[~pachamaltese] - totally missed this in my initial review of the code, but the 
thing that actually needs changing is the bindings in `compute.cpp` - here, 
start and stop have been set to 1 and -1 respectively, but instead need to 
reflect the default values from here: 
[https://github.com/apache/arrow/blob/7eea2f53a1002552bbb87db5611e75c15b88b504/cpp/src/arrow/compute/api_scalar.h#L203-L210]

I think that the `step` argument also needs implementing too.

We really should write this up (I can add it to my to-do list!) as it's neither 
obvious nor trivial to work out the various steps required here.

 

> [C++] Enable slicing to end of string using "utf8_slice_codeunits" when 
> string length unknown or different lengths 
> ---
>
> Key: ARROW-13259
> URL: https://issues.apache.org/jira/browse/ARROW-13259
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> We're currently trying to write bindings from the C++ function 
> "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour 
> of R's string::str_sub
> In both the R and C++ implementations, I can use negative indices to count 
> back from the end of a string (show below in R, but the latter directly 
> invokes the C++ implementation):
>  
> {code:java}
> # stringr version
> > stringr::str_sub("Apache Arrow", -5, -2)
> [1] "Arro"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=-1L))
> Scalar
> Arro{code}
> Note that in the C++ implementation, I have to add 1 to the stop value as the 
> final value is non-inclusive.
> The problem is when I'm trying to use negative indices to refer to the final 
> values in a string:
>  
> {code:java}
> stringr version
> > stringr::str_sub("Apache Arrow", -5, -1)
> [1] "Arrow"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=0L))
> Scalar
> {code}
> The result is blank as the 'stop' value 0 refers to the start of the string, 
> effective walking backwards, which isn't possible (except via the step 
> argument which I can't get working but I don't think is what I want anyway).
> I've tried to get around this by attempting to write some code that 
> calculates the length of the string and supply that to the stop argument, but 
> it didn't work.
> I do have a possible workaround that involves reversing the string, 
> extracting the substring using inverted values of swapped stop/start values, 
> and then reversing the result, but before I go down that path, I was 
> wondering if there is anything that can (and should! the answer may be a 
> simple "nope!") be changed in the C++ code to make it possible to do this a 
> different way?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths

2021-07-08 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377220#comment-17377220
 ] 

Eduardo Ponce commented on ARROW-13259:
---

Created [ARROW-13288|https://issues.apache.org/jira/browse/ARROW-13288] to 
verify inconsistencies between the C++ and Python kernel options.

> [C++] Enable slicing to end of string using "utf8_slice_codeunits" when 
> string length unknown or different lengths 
> ---
>
> Key: ARROW-13259
> URL: https://issues.apache.org/jira/browse/ARROW-13259
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> We're currently trying to write bindings from the C++ function 
> "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour 
> of R's string::str_sub
> In both the R and C++ implementations, I can use negative indices to count 
> back from the end of a string (show below in R, but the latter directly 
> invokes the C++ implementation):
>  
> {code:java}
> # stringr version
> > stringr::str_sub("Apache Arrow", -5, -2)
> [1] "Arro"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=-1L))
> Scalar
> Arro{code}
> Note that in the C++ implementation, I have to add 1 to the stop value as the 
> final value is non-inclusive.
> The problem is when I'm trying to use negative indices to refer to the final 
> values in a string:
>  
> {code:java}
> stringr version
> > stringr::str_sub("Apache Arrow", -5, -1)
> [1] "Arrow"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=0L))
> Scalar
> {code}
> The result is blank as the 'stop' value 0 refers to the start of the string, 
> effective walking backwards, which isn't possible (except via the step 
> argument which I can't get working but I don't think is what I want anyway).
> I've tried to get around this by attempting to write some code that 
> calculates the length of the string and supply that to the stop argument, but 
> it didn't work.
> I do have a possible workaround that involves reversing the string, 
> extracting the substring using inverted values of swapped stop/start values, 
> and then reversing the result, but before I go down that path, I was 
> wondering if there is anything that can (and should! the answer may be a 
> simple "nope!") be changed in the C++ code to make it possible to do this a 
> different way?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths

2021-07-08 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377194#comment-17377194
 ] 

Eduardo Ponce commented on ARROW-13259:
---

[In C++ by default {{SliceOptions}} has the {{stop}} option set to 
{{std::numeric_limits::max()}}|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/api_scalar.h#L205-L206].
 Therefore, if you want to slice to end of string simply omit a value for 
{{stop}} or set it to a value >= len(string).
{code:c++}
// start=-5, stop=std::numeric_limits::max(), step=1
SliceOptions opts(-5);
auto result = CallFunction("utf8_slice_codeunits", {Datum("Apache Arrow")}, 
);
if (result.ok()) {
Datum slice = std::move(result).ValueOrDie();
// Prints "Arrow"
std::cout << slice.scalar()->ToString() << std::endl;
} else {
ARROW_LOG(ERROR) << result.status();
}
{code}
 

In R you should be able to do the following,
{code:r}
# C++ version
> call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options 
> = list(start=-5L))
[1] "Arrow"
{code}
 

[~jorisvandenbossche]
 The issue in PyArrow arises because the [interface for {{SliceOptions}} does 
not sets the default value for {{stop}} option (only for {{step}} 
option)|https://github.com/apache/arrow/blob/master/python/pyarrow/_compute.pyx#L798].
 Therefore, these are required arguments.
{code:python}
>>> string = 'Apache Arrow'
>>> pc.utf8_slice_codeunits(string, start=-5, stop=len(string))

{code}
 

[By providing {{sys.maxsize}} as default {{stop}} 
option|https://github.com/edponce/arrow/blob/ARROW-13259-Enable-slicing-to-end-of-string-using-ut/python/pyarrow/_compute.pyx#L800-L802],
 we can do the following:
{code:python}
>>> string = 'Apache Arrow'
>>> pc.utf8_slice_codeunits(string, start=-5)

{code}
 

The question that naturally follows from this JIRA is: *Are all the default 
options in PyArrow and R bindings consistent with C++ defaults?*

> [C++] Enable slicing to end of string using "utf8_slice_codeunits" when 
> string length unknown or different lengths 
> ---
>
> Key: ARROW-13259
> URL: https://issues.apache.org/jira/browse/ARROW-13259
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> We're currently trying to write bindings from the C++ function 
> "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour 
> of R's string::str_sub
> In both the R and C++ implementations, I can use negative indices to count 
> back from the end of a string (show below in R, but the latter directly 
> invokes the C++ implementation):
>  
> {code:java}
> # stringr version
> > stringr::str_sub("Apache Arrow", -5, -2)
> [1] "Arro"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=-1L))
> Scalar
> Arro{code}
> Note that in the C++ implementation, I have to add 1 to the stop value as the 
> final value is non-inclusive.
> The problem is when I'm trying to use negative indices to refer to the final 
> values in a string:
>  
> {code:java}
> stringr version
> > stringr::str_sub("Apache Arrow", -5, -1)
> [1] "Arrow"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=0L))
> Scalar
> {code}
> The result is blank as the 'stop' value 0 refers to the start of the string, 
> effective walking backwards, which isn't possible (except via the step 
> argument which I can't get working but I don't think is what I want anyway).
> I've tried to get around this by attempting to write some code that 
> calculates the length of the string and supply that to the stop argument, but 
> it didn't work.
> I do have a possible workaround that involves reversing the string, 
> extracting the substring using inverted values of swapped stop/start values, 
> and then reversing the result, but before I go down that path, I was 
> wondering if there is anything that can (and should! the answer may be a 
> simple "nope!") be changed in the C++ code to make it possible to do this a 
> different way?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths

2021-07-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374939#comment-17374939
 ] 

Mauricio 'Pachá' Vargas Sepúlveda commented on ARROW-13259:
---

thanks a lot, I've edited my PR
since I'm on 21.04, I'm considering doing my work on a virtual machine until 
the build is fixed

> [C++] Enable slicing to end of string using "utf8_slice_codeunits" when 
> string length unknown or different lengths 
> ---
>
> Key: ARROW-13259
> URL: https://issues.apache.org/jira/browse/ARROW-13259
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> We're currently trying to write bindings from the C++ function 
> "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour 
> of R's string::str_sub
> In both the R and C++ implementations, I can use negative indices to count 
> back from the end of a string (show below in R, but the latter directly 
> invokes the C++ implementation):
>  
> {code:java}
> # stringr version
> > stringr::str_sub("Apache Arrow", -5, -2)
> [1] "Arro"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=-1L))
> Scalar
> Arro{code}
> Note that in the C++ implementation, I have to add 1 to the stop value as the 
> final value is non-inclusive.
> The problem is when I'm trying to use negative indices to refer to the final 
> values in a string:
>  
> {code:java}
> stringr version
> > stringr::str_sub("Apache Arrow", -5, -1)
> [1] "Arrow"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=0L))
> Scalar
> {code}
> The result is blank as the 'stop' value 0 refers to the start of the string, 
> effective walking backwards, which isn't possible (except via the step 
> argument which I can't get working but I don't think is what I want anyway).
> I've tried to get around this by attempting to write some code that 
> calculates the length of the string and supply that to the stop argument, but 
> it didn't work.
> I do have a possible workaround that involves reversing the string, 
> extracting the substring using inverted values of swapped stop/start values, 
> and then reversing the result, but before I go down that path, I was 
> wondering if there is anything that can (and should! the answer may be a 
> simple "nope!") be changed in the C++ code to make it possible to do this a 
> different way?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths

2021-07-05 Thread Nic Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374924#comment-17374924
 ] 

Nic Crane commented on ARROW-13259:
---

Thanks very much [~maartenbreddels] and [~jorisvandenbossche] ! 

[~lidavidm] - nah, it's fine, I can just copy from the Python implementation 
and chuck in some R code like
{code:java}
if(stop==-1)stop = .Machine$integer.max{code}
CC [~pachamaltese]

> [C++] Enable slicing to end of string using "utf8_slice_codeunits" when 
> string length unknown or different lengths 
> ---
>
> Key: ARROW-13259
> URL: https://issues.apache.org/jira/browse/ARROW-13259
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> We're currently trying to write bindings from the C++ function 
> "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour 
> of R's string::str_sub
> In both the R and C++ implementations, I can use negative indices to count 
> back from the end of a string (show below in R, but the latter directly 
> invokes the C++ implementation):
>  
> {code:java}
> # stringr version
> > stringr::str_sub("Apache Arrow", -5, -2)
> [1] "Arro"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=-1L))
> Scalar
> Arro{code}
> Note that in the C++ implementation, I have to add 1 to the stop value as the 
> final value is non-inclusive.
> The problem is when I'm trying to use negative indices to refer to the final 
> values in a string:
>  
> {code:java}
> stringr version
> > stringr::str_sub("Apache Arrow", -5, -1)
> [1] "Arrow"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=0L))
> Scalar
> {code}
> The result is blank as the 'stop' value 0 refers to the start of the string, 
> effective walking backwards, which isn't possible (except via the step 
> argument which I can't get working but I don't think is what I want anyway).
> I've tried to get around this by attempting to write some code that 
> calculates the length of the string and supply that to the stop argument, but 
> it didn't work.
> I do have a possible workaround that involves reversing the string, 
> extracting the substring using inverted values of swapped stop/start values, 
> and then reversing the result, but before I go down that path, I was 
> wondering if there is anything that can (and should! the answer may be a 
> simple "nope!") be changed in the C++ code to make it possible to do this a 
> different way?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths

2021-07-05 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374908#comment-17374908
 ] 

David Li commented on ARROW-13259:
--

Maybe we could add a SliceOptions::kEnd constant just to make it clear what to 
do? (Not sure that'd help R?)

> [C++] Enable slicing to end of string using "utf8_slice_codeunits" when 
> string length unknown or different lengths 
> ---
>
> Key: ARROW-13259
> URL: https://issues.apache.org/jira/browse/ARROW-13259
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> We're currently trying to write bindings from the C++ function 
> "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour 
> of R's string::str_sub
> In both the R and C++ implementations, I can use negative indices to count 
> back from the end of a string (show below in R, but the latter directly 
> invokes the C++ implementation):
>  
> {code:java}
> # stringr version
> > stringr::str_sub("Apache Arrow", -5, -2)
> [1] "Arro"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=-1L))
> Scalar
> Arro{code}
> Note that in the C++ implementation, I have to add 1 to the stop value as the 
> final value is non-inclusive.
> The problem is when I'm trying to use negative indices to refer to the final 
> values in a string:
>  
> {code:java}
> stringr version
> > stringr::str_sub("Apache Arrow", -5, -1)
> [1] "Arrow"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=0L))
> Scalar
> {code}
> The result is blank as the 'stop' value 0 refers to the start of the string, 
> effective walking backwards, which isn't possible (except via the step 
> argument which I can't get working but I don't think is what I want anyway).
> I've tried to get around this by attempting to write some code that 
> calculates the length of the string and supply that to the stop argument, but 
> it didn't work.
> I do have a possible workaround that involves reversing the string, 
> extracting the substring using inverted values of swapped stop/start values, 
> and then reversing the result, but before I go down that path, I was 
> wondering if there is anything that can (and should! the answer may be a 
> simple "nope!") be changed in the C++ code to make it possible to do this a 
> different way?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths

2021-07-05 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374907#comment-17374907
 ] 

Joris Van den Bossche commented on ARROW-13259:
---

To copy over the practical example:

{code}
In [24]: import sys

In [25]: string = "Apache Arrow"

In [26]: pc.utf8_slice_codeunits(string, start=-5, stop=sys.maxsize)
Out[26]: 

In [27]: pc.utf8_slice_codeunits(string, start=-5, stop=-1)
Out[27]: 
{code}

So "a large integer" can be used to indicate "slice until the end" (I suppose 
because you can never have a scalar string with a longer length than that 
value?). 
In Python this is {{sys.maxsize}}, in C++ it's 
{{std::numeric_limits::max()}}.

> [C++] Enable slicing to end of string using "utf8_slice_codeunits" when 
> string length unknown or different lengths 
> ---
>
> Key: ARROW-13259
> URL: https://issues.apache.org/jira/browse/ARROW-13259
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> We're currently trying to write bindings from the C++ function 
> "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour 
> of R's string::str_sub
> In both the R and C++ implementations, I can use negative indices to count 
> back from the end of a string (show below in R, but the latter directly 
> invokes the C++ implementation):
>  
> {code:java}
> # stringr version
> > stringr::str_sub("Apache Arrow", -5, -2)
> [1] "Arro"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=-1L))
> Scalar
> Arro{code}
> Note that in the C++ implementation, I have to add 1 to the stop value as the 
> final value is non-inclusive.
> The problem is when I'm trying to use negative indices to refer to the final 
> values in a string:
>  
> {code:java}
> stringr version
> > stringr::str_sub("Apache Arrow", -5, -1)
> [1] "Arrow"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=0L))
> Scalar
> {code}
> The result is blank as the 'stop' value 0 refers to the start of the string, 
> effective walking backwards, which isn't possible (except via the step 
> argument which I can't get working but I don't think is what I want anyway).
> I've tried to get around this by attempting to write some code that 
> calculates the length of the string and supply that to the stop argument, but 
> it didn't work.
> I do have a possible workaround that involves reversing the string, 
> extracting the substring using inverted values of swapped stop/start values, 
> and then reversing the result, but before I go down that path, I was 
> wondering if there is anything that can (and should! the answer may be a 
> simple "nope!") be changed in the C++ code to make it possible to do this a 
> different way?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths

2021-07-05 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374890#comment-17374890
 ] 

Maarten Breddels commented on ARROW-13259:
--

Does my comment [https://github.com/apache/arrow/pull/9000#issue-544990164] 
help you out?

> [C++] Enable slicing to end of string using "utf8_slice_codeunits" when 
> string length unknown or different lengths 
> ---
>
> Key: ARROW-13259
> URL: https://issues.apache.org/jira/browse/ARROW-13259
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nic Crane
>Priority: Major
>
> We're currently trying to write bindings from the C++ function 
> "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour 
> of R's string::str_sub
> In both the R and C++ implementations, I can use negative indices to count 
> back from the end of a string (show below in R, but the latter directly 
> invokes the C++ implementation):
>  
> {code:java}
> # stringr version
> > stringr::str_sub("Apache Arrow", -5, -2)
> [1] "Arro"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=-1L))
> Scalar
> Arro{code}
> Note that in the C++ implementation, I have to add 1 to the stop value as the 
> final value is non-inclusive.
> The problem is when I'm trying to use negative indices to refer to the final 
> values in a string:
>  
> {code:java}
> stringr version
> > stringr::str_sub("Apache Arrow", -5, -1)
> [1] "Arrow"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), 
> > options = list(start=-5L, stop=0L))
> Scalar
> {code}
> The result is blank as the 'stop' value 0 refers to the start of the string, 
> effective walking backwards, which isn't possible (except via the step 
> argument which I can't get working but I don't think is what I want anyway).
> I've tried to get around this by attempting to write some code that 
> calculates the length of the string and supply that to the stop argument, but 
> it didn't work.
> I do have a possible workaround that involves reversing the string, 
> extracting the substring using inverted values of swapped stop/start values, 
> and then reversing the result, but before I go down that path, I was 
> wondering if there is anything that can (and should! the answer may be a 
> simple "nope!") be changed in the C++ code to make it possible to do this a 
> different way?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)