[
https://issues.apache.org/jira/browse/SPARK-48284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051938#comment-18051938
]
Uroš Bojanić commented on SPARK-48284:
--------------------------------------
Hi [~LuciferYang], thank you for your comment. Let me try to explain my
reasoning below in more detail.
I don't think that we should necessarily always anchor Spark's UTF8String
method behaviour to Java's String, but I also don't have a strong opinion
regarding the last 2 examples - I'm open to suggestions and hold no strong
opinions here. Let's figure out the best behaviour together.
In my original proposal, I based the proposed behaviour on the basis of
maintaining the following logical property:
{code:java}
haystack.substring(haystack.indexOf(needle, startPosition), needle.length())
=== needle{code}
In other words, you can guarantee that "needle" is really found in "haystack"
at the position returned by `indexOf`.
1) For example:
{code:java}
"abc".indexOf("b", 0); // returns 1, because "b" is found in "abc" at
position=0 (when searching from startPosition=0)
"abc".indexOf("b", 1); // returns 1, because "b" is found in "abc" at
position=1 (when searching from startPosition=1)
"abc".indexOf("b", 2); // returns -1, because "b" is NOT found in "abc" when
searching from startPosition=2
"abc".indexOf("b", 5); // returns -1, because "b" is NOT found in "abc" when
searching from startPosition=5{code}
Note that this is exactly how Java indexOf behaves currently.
2) Another example:
{code:java}
"bbb".indexOf("b", 0); // returns 0, because "b" is found in "bbb" at
position=0 (when searching from startPosition=0)
"bbb".indexOf("b", 1); // returns 1, because "b" is found in "bbb" at
position=1 (when searching from startPosition=1)
"bbb".indexOf("b", 2); // returns 2, because "b" is found in "bbb" at
position=2 (when searching from startPosition=2)
"bbb".indexOf("b", 5); // returns -1, because "b" is NOT found in "bbb" when
searching from startPosition=5{code}
Note that this is exactly how Java indexOf behaves currently.
3) Extending this logic to empty "needle", I would expect the following:
{code:java}
"abc".indexOf("", 0); // returns 0, because "" is found in "abc" at position=0
(when searching from startPosition=0)
"abc".indexOf("", 1); // returns 1, because "" is found in "abc" at position=1
(when searching from startPosition=1)
"abc".indexOf("", 2); // returns 2, because "" is found in "abc" at position=2
(when searching from startPosition=2)
"abc".indexOf("", 5); // returns -1, because "" is NOT found in "abc" when
searching from startPosition=5{code}
For some reason, this is NOT how Java indexOf behaves currently.
The proposed behaviour would guarantee that if you take the result of indexOf,
you will always be able to do something like:
{code:java}
int position = haystack.indexOf(needle, startPosition);
String match = haystack.substring(position, needle.length());
assert(match.equals(needle)){code}
I find this a natural property to rely on, as a caller of indexOf. Also, I'm
not quite sure why the Java String behaves the way it does.
What do you think?
> Fix UTF8String indexOf behaviour for empty string search
> --------------------------------------------------------
>
> Key: SPARK-48284
> URL: https://issues.apache.org/jira/browse/SPARK-48284
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Uroš Bojanić
> Priority: Major
> Labels: pull-request-available
>
> Currently, UTF8String.indexOf returns 0 when given an empty parameters
> string, and any integer start value.
> Examples:
> {{"abc".indexOf("", 0); // returns: 0}}
> {{"abc".indexOf("", 2); // returns: 0}}
> {{"abc".indexOf("", 9); // returns: 0}}
> {{"abc".indexOf("", -3); // returns: 0}}
> This is not correct, as "start" is not taken into consideration.
> Correct behaviour would be:
> {{"abc".indexOf("", 0); // returns: 0}}
> {{"abc".indexOf("", 2); // returns: 2}}
> {{"abc".indexOf("", 9); // returns: -1}}
> {{"abc".indexOf("", -3); // returns: -1}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]