Re: RFR 8230365 : Pattern for a control-char matches non-control characters

Ivan Gerasimov Thu, 05 Sep 2019 13:44:23 -0700

Hi Stuart!

Thanks for the comments, and please see my answers inline:


On 9/5/19 1:18 PM, Stuart Marks wrote:

Conceptually I think having the restriction is fine. If we weredesigning this as a new feature, I'd recommend putting in therestrictions from the very beginning.
However, since the old behavior has been out there for 20 years, mymain concern is compatibility. Having system properties to controlthis is ... ok ... I guess. I wish there were a better way, but itseems the simplest. However, I'd strongly advise making the behaviorchange and the compatibility mode(s) as simple as possible. Havingmore configuration options complicates the code and complicates thetesting, and I really don't see the need for it.
So yes, I agree with abandoning ALLOW_LOWERCASE_CONTROL_CHAR_IDS.

Okay.  I've just sent out the updated webrev without this option:
http://cr.openjdk.java.net/~igerasim/8230365/03/webrev/

And filed the CSR for the spec change and the new property:
https://bugs.openjdk.java.net/browse/JDK-8230675

I guess now we get to have a bikeshed on the property name. Does thisapply only to mapping of control characters, or is there some otherform of pattern syntax that could be made more strict? Ivan, I seem torecall you talking to me about making some changes in the syntax orinterpretation of the construct that enables/disables various flags.This one:
|(?idmsuxU-idmsuxU)|
Would it make sense to have a single property, e.g.jdk.util.regex.strictSyntax, that governs this one as well? And arethere other possibilities for tuning up the parsing behavior that weshould be taking?

Yes, that was this bug:

https://bugs.openjdk.java.net/browse/JDK-8225021 (Treat ambiguousembedded flags as parse syntax errors)


And actually I have yet another bug of similar flavor:

https://bugs.openjdk.java.net/browse/JDK-8214245 (Case insensitivematching doesn't work correctly for some character classes)

However, this last one is about changing the matching rules, and notabout the syntax (And it is still waiting for review!).

I'd rather not have a situation where we fix up one area of syntax andadd a property for it, fix up another area and add a differentproperty, leading to property proliferation.

Personally, I don't have a strong preference here.

The compatibility property are meant to be temporary anyways.

Maybe if we have a single option that will control several differentaspects of behavior, it will be harder to get rid of it?

Partially, because it will be tempting to reuse it for other similarchanges, should they be needed.


With kind regards,

Ivan

s'marks


On 9/4/19 9:00 PM, Martin Buchholz wrote:

Thanks, Ivan.  We're mostly in agreement.

+     * If {@code true} then lower-case control-character ids are mapped to the
+     * their upper-case counterparts.
Extra "the".

After all these decades I only now realize that c ^= 0x40 moves '?'to the end of the ASCII range and all the other controls to the start!

Should we support lower-case controls? Compatibility with perl regexstill matters, but a lot less than in 2003. But the key is that wegot the WRONG ANSWER previously, so when we restrict the control idslet's just make lower case controls syntax errors. Silently changingbehavior is bad for users. ... so let's abandonALLOW_LOWERCASE_CONTROL_CHAR_IDS.

An alternative:
int ch = read() ^ 0x40;
if (!RESTRICTED_CONTROL_CHAR_IDS || ch < 0x20 || ch == 0x7f) return ch;

On Wed, Sep 4, 2019 at 6:49 PM Ivan Gerasimov<[email protected] <mailto:[email protected]>> wrote:


    Thank you Martin!

    On 8/30/19 6:19 PM, Martin Buchholz wrote:
    > There's a strong expectation that ctrl-A and ctrl-a both map to
    > \u0001, so I support Ivan's initiative.
    > I'm surprised java regex gets this wrong.
    > Might need a transitional system property.
    >
    Right.  I think it would be best to introduce two system properties:

    The first, to turn on/off the restrictions on the control-char
    names.
    This will be enabled by default, and will permit names from the
    limited
    list: capital letters and a few other special characters.

    The second one, to enable mapping of lower-case control-char
    names to
    their upper-case counterpart.  This option should be turned off by
    default for the current release of JDK, and then turned on by
    default
    for some subsequent release (when, presumably, most applications
    that
    use this kind of regexp are fixed).

    This all feels like a little bit too much for such a rarely used
    feature, but probably is a proper thing to do anyway :-)

    If we have an agreement on these system properties, I can create a
    separate test to verify all possible combinations.


    > What's the best bit-twiddle?  Untested:
    > if ((c ^= 0x40) < 0x20) return c;
    > if ((c ^=0x20)  <= 26 && c > 0) return c;
    >
    > 0x40 is more readable than 64.
    >
    `((ch-0x3f)|(0x5f-ch)) >= 0` does the trick for regular
    (non-lower-case)
    ids.

    > Does ctrol-? get mapped to 0x7f ?
    >
    Yes. I've got it in the test at the end of the line 4997.

    Would you please help review the updated webrev:

    http://cr.openjdk.java.net/~igerasim/8230365/02/webrev/

    With kind regards,

    Ivan

--
With kind regards,
Ivan Gerasimov

Re: RFR 8230365 : Pattern for a control-char matches non-control characters

Reply via email to