[jira] [Comment Edited] (VALIDATOR-501) DomainValidator accepts hyphens at start/end of domain name with Unicode characters

Kushal Dixit (Jira) Thu, 11 Sep 2025 07:24:07 -0700


    [ 
https://issues.apache.org/jira/browse/VALIDATOR-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018592#comment-18018592
 ]


Kushal Dixit edited comment on VALIDATOR-501 at 9/11/25 2:21 PM:
-----------------------------------------------------------------

Hi [~ggregory] ,

I see you have added test cases in revision [319400a 
.|https://github.com/apache/commons-validator/commit/319400a5ab8a694dd8ebc6131ae3fd20c9db0edd].
 I'm not sure if a fix is already available, but I'd like to share a potential 
solution I've been working on.

*RCA:*
Validation for leading and trailing hyphens (and related patterns) is already 
implemented. The issue arises with domains containing special characters, such 
as {*}-tést.fr-{*}. These get converted to punycode (e.g., 
{*}{{xn-tst-cpa.fr}}{*}). In the punycode form, there are no visible leading or 
trailing hyphens, so the domain appears valid even though the original form had 
an invalid leading hyphen.

*Proposed Solution:*

While the issue could potentially be solved by restructuring the existing regex 
validation and {{unicodeToASCII}} conversion, I believe adding an efficient 
pre-validation utility method would be a cleaner approach. This method would 
check for RFC-violating hyphens *before* the IDN conversion takes place, 
preventing the conversion from masking the original violations.
{code:java}
public boolean isValid(final String domain) {
    if (domain == null) {
        return false;
    }

    // Early checking for hyphens (-)
    if (hasLeadingOrTrailingHyphensInLabel(domain)) {
        return false;
    }

    // ... existing code
}

private boolean hasLeadingOrTrailingHyphensInLabel(final String domain) {
    // early exit
    final int n = domain.length();
    if (n == 0) {
        return false;
    }

    // remove leading white spaces.
    int start = 0;
    while (start < n && Character.isWhitespace(domain.charAt(start))) {
        start++;
    }

    if (start == n) {
        return false;
    }

    // remove trailing white spaces.
    int end = n - 1;
    while (end >= start && Character.isWhitespace(domain.charAt(end))) {
        end--;
    }

    // check first and last char is (-) just for early exit
    if (domain.charAt(start) == '-' || domain.charAt(end) == '-') {
        return true;
    }

    for (int i = start + 1; i <= end; i++) {
        if (domain.charAt(i) == '.') {
            // Find last non-whitespace before (.)
            int prevPos = i - 1;
            while (prevPos >= start && 
Character.isWhitespace(domain.charAt(prevPos))) {
                prevPos--;
            }
            if (prevPos >= start && domain.charAt(prevPos) == '-') {
                return true;
            }

            // Find first non-whitespace after (.)
            int nextPos = i + 1;
            while (nextPos < end && 
Character.isWhitespace(domain.charAt(nextPos))) {
                nextPos++;
            }
            if (nextPos < end && domain.charAt(nextPos) == '-') {
                return true;
            }
        }
    }

    // finally return false if (-) is found no-where.
    return false;
}{code}
If code readability is prioritised over performance, simpler implementations 
using {{{}regex{}}}, {{trim()}} and {{split()}} would also work, but the core 
principle remains the same: {*}validate hyphen placement before IDN 
conversion{*}.
 
We can add an early length check in the utility method to quickly reject 
obviously invalid domains. This way, the complexity stays bounded by 
*O(MAX_DOMAIN_LENGTH)* rather than scaling with input size.
{code:java}
if (domain.length() > MAX_DOMAIN_LENGTH) {
    return true; // invalid domain
} {code}
Let me know your thoughts on this approach!

{*}Note{*}: I have tested with different scenarios and it seems to be working 
for most of the cases, I found white spaces were creating some problem for ex. 
{{www.  -google.com}} hence I handled it through different while loops.


was (Author: JIRAUSER310903):
Hi [~ggregory] ,

I see you have added test cases in revision [319400a 
.|https://github.com/apache/commons-validator/commit/319400a5ab8a694dd8ebc6131ae3fd20c9db0edd].
 I'm not sure if a fix is already available, but I'd like to share a potential 
solution I've been working on.

*RCA:*
Validation for leading and trailing hyphens (and related patterns) is already 
implemented. The issue arises with domains containing special characters, such 
as {*}-tést.fr{*}. These get converted to punycode (e.g., 
{*}{{xn--tst-cpa.fr}}{*}). In the punycode form, there are no visible leading 
or trailing hyphens, so the domain appears valid even though the original form 
had an invalid leading hyphen.

While the issue could potentially be solved by restructuring the existing regex 
validation and {{unicodeToASCII}} conversion, I believe adding an efficient 
pre-validation utility method would be a cleaner approach. This method would 
check for RFC-violating hyphens *before* the IDN conversion takes place, 
preventing the conversion from masking the original violations.
{code:java}
public boolean isValid(final String domain) {
    if (domain == null) {
        return false;
    }

    // Early checking for hyphens (-)
    if (hasLeadingOrTrailingHyphensInLabel(domain)) {
        return false;
    }

    // ... existing code
}

private boolean hasLeadingOrTrailingHyphensInLabel(final String domain) {
    // early exit
    final int n = domain.length();
    if (n == 0) {
        return false;
    }

    // remove leading white spaces.
    int start = 0;
    while (start < n && Character.isWhitespace(domain.charAt(start))) {
        start++;
    }

    if (start == n) {
        return false;
    }

    // remove trailing white spaces.
    int end = n - 1;
    while (end >= start && Character.isWhitespace(domain.charAt(end))) {
        end--;
    }

    // check first and last char is (-) just for early exit
    if (domain.charAt(start) == '-' || domain.charAt(end) == '-') {
        return true;
    }

    for (int i = start + 1; i <= end; i++) {
        if (domain.charAt(i) == '.') {
            // Find last non-whitespace before (.)
            int prevPos = i - 1;
            while (prevPos >= start && 
Character.isWhitespace(domain.charAt(prevPos))) {
                prevPos--;
            }
            if (prevPos >= start && domain.charAt(prevPos) == '-') {
                return true;
            }

            // Find first non-whitespace after (.)
            int nextPos = i + 1;
            while (nextPos < end && 
Character.isWhitespace(domain.charAt(nextPos))) {
                nextPos++;
            }
            if (nextPos < end && domain.charAt(nextPos) == '-') {
                return true;
            }
        }
    }

    // finally return false if (-) is found no-where.
    return false;
}{code}
This is most efficient version of {{hasLeadingOrTrailingHyphensInLabel.}}

If code readability is prioritised over performance, simpler implementations 
using {{{}regex{}}}, {{trim()}} and {{split()}} would also work, but the core 
principle remains the same: {*}validate hyphen placement before IDN 
conversion{*}.
 
We can add an early length check in the utility method to quickly reject 
obviously invalid domains. This way, the complexity stays bounded by 
*O(MAX_DOMAIN_LENGTH)* rather than scaling with input size.
{code:java}
if (domain.length() > MAX_DOMAIN_LENGTH) {
    return true; // invalid domain
} {code}
Let me know your thoughts on this approach!

{*}Note{*}: I have tested with different scenarios and it seems to be working 
for most of the cases, I found white spaces were creating some problem for ex. 
{{www.  -google.com}} hence I handled it through different while loops.

> DomainValidator accepts hyphens at start/end of domain name with Unicode 
> characters
> -----------------------------------------------------------------------------------
>
>                 Key: VALIDATOR-501
>                 URL: https://issues.apache.org/jira/browse/VALIDATOR-501
>             Project: Commons Validator
>          Issue Type: Bug
>            Reporter: Victoria Dimitrova
>            Priority: Major
>              Labels: BUG, validator
>
> The method `DomainValidator.getInstance().isValid()` returns `true` for 
> invalid domain names that start or end with a hyphen when the domain name 
> contains Unicode characters.
> This behavior is incorrect according to the domain name specifications (RFC 
> 1035, RFC 5890), which do not allow:
>  - Domain labels starting or ending with a hyphen (`-`)
> Test case that should pass but fails on version 1.10.0:
> {code:java}
> @Test
> void shouldBeInvalid()
> {
>  assertAll(() -> 
> assertFalse(DomainValidator.getInstance().isValid("-test.fr")),  
> () -> assertFalse(DomainValidator.getInstance().isValid("-tést.fr")),
> () -> assertFalse(DomainValidator.getInstance().isValid("test-.fr")),    
> () -> assertFalse(DomainValidator.getInstance().isValid("tést-.fr")));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (VALIDATOR-501) DomainValidator accepts hyphens at start/end of domain name with Unicode characters

Reply via email to