[
https://issues.apache.org/jira/browse/VALIDATOR-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018592#comment-18018592
]
Kushal Dixit edited comment on VALIDATOR-501 at 9/11/25 2:21 PM:
-----------------------------------------------------------------
Hi [~ggregory] ,
I see you have added test cases in revision [319400a
.|https://github.com/apache/commons-validator/commit/319400a5ab8a694dd8ebc6131ae3fd20c9db0edd].
I'm not sure if a fix is already available, but I'd like to share a potential
solution I've been working on.
*RCA:*
Validation for leading and trailing hyphens (and related patterns) is already
implemented. The issue arises with domains containing special characters, such
as {*}-tést.fr-{*}. These get converted to punycode (e.g.,
{*}{{xn-tst-cpa.fr}}{*}). In the punycode form, there are no visible leading or
trailing hyphens, so the domain appears valid even though the original form had
an invalid leading hyphen.
*Proposed Solution:*
While the issue could potentially be solved by restructuring the existing regex
validation and {{unicodeToASCII}} conversion, I believe adding an efficient
pre-validation utility method would be a cleaner approach. This method would
check for RFC-violating hyphens *before* the IDN conversion takes place,
preventing the conversion from masking the original violations.
{code:java}
public boolean isValid(final String domain) {
if (domain == null) {
return false;
}
// Early checking for hyphens (-)
if (hasLeadingOrTrailingHyphensInLabel(domain)) {
return false;
}
// ... existing code
}
private boolean hasLeadingOrTrailingHyphensInLabel(final String domain) {
// early exit
final int n = domain.length();
if (n == 0) {
return false;
}
// remove leading white spaces.
int start = 0;
while (start < n && Character.isWhitespace(domain.charAt(start))) {
start++;
}
if (start == n) {
return false;
}
// remove trailing white spaces.
int end = n - 1;
while (end >= start && Character.isWhitespace(domain.charAt(end))) {
end--;
}
// check first and last char is (-) just for early exit
if (domain.charAt(start) == '-' || domain.charAt(end) == '-') {
return true;
}
for (int i = start + 1; i <= end; i++) {
if (domain.charAt(i) == '.') {
// Find last non-whitespace before (.)
int prevPos = i - 1;
while (prevPos >= start &&
Character.isWhitespace(domain.charAt(prevPos))) {
prevPos--;
}
if (prevPos >= start && domain.charAt(prevPos) == '-') {
return true;
}
// Find first non-whitespace after (.)
int nextPos = i + 1;
while (nextPos < end &&
Character.isWhitespace(domain.charAt(nextPos))) {
nextPos++;
}
if (nextPos < end && domain.charAt(nextPos) == '-') {
return true;
}
}
}
// finally return false if (-) is found no-where.
return false;
}{code}
If code readability is prioritised over performance, simpler implementations
using {{{}regex{}}}, {{trim()}} and {{split()}} would also work, but the core
principle remains the same: {*}validate hyphen placement before IDN
conversion{*}.
We can add an early length check in the utility method to quickly reject
obviously invalid domains. This way, the complexity stays bounded by
*O(MAX_DOMAIN_LENGTH)* rather than scaling with input size.
{code:java}
if (domain.length() > MAX_DOMAIN_LENGTH) {
return true; // invalid domain
} {code}
Let me know your thoughts on this approach!
{*}Note{*}: I have tested with different scenarios and it seems to be working
for most of the cases, I found white spaces were creating some problem for ex.
{{www. -google.com}} hence I handled it through different while loops.
was (Author: JIRAUSER310903):
Hi [~ggregory] ,
I see you have added test cases in revision [319400a
.|https://github.com/apache/commons-validator/commit/319400a5ab8a694dd8ebc6131ae3fd20c9db0edd].
I'm not sure if a fix is already available, but I'd like to share a potential
solution I've been working on.
*RCA:*
Validation for leading and trailing hyphens (and related patterns) is already
implemented. The issue arises with domains containing special characters, such
as {*}-tést.fr{*}. These get converted to punycode (e.g.,
{*}{{xn--tst-cpa.fr}}{*}). In the punycode form, there are no visible leading
or trailing hyphens, so the domain appears valid even though the original form
had an invalid leading hyphen.
While the issue could potentially be solved by restructuring the existing regex
validation and {{unicodeToASCII}} conversion, I believe adding an efficient
pre-validation utility method would be a cleaner approach. This method would
check for RFC-violating hyphens *before* the IDN conversion takes place,
preventing the conversion from masking the original violations.
{code:java}
public boolean isValid(final String domain) {
if (domain == null) {
return false;
}
// Early checking for hyphens (-)
if (hasLeadingOrTrailingHyphensInLabel(domain)) {
return false;
}
// ... existing code
}
private boolean hasLeadingOrTrailingHyphensInLabel(final String domain) {
// early exit
final int n = domain.length();
if (n == 0) {
return false;
}
// remove leading white spaces.
int start = 0;
while (start < n && Character.isWhitespace(domain.charAt(start))) {
start++;
}
if (start == n) {
return false;
}
// remove trailing white spaces.
int end = n - 1;
while (end >= start && Character.isWhitespace(domain.charAt(end))) {
end--;
}
// check first and last char is (-) just for early exit
if (domain.charAt(start) == '-' || domain.charAt(end) == '-') {
return true;
}
for (int i = start + 1; i <= end; i++) {
if (domain.charAt(i) == '.') {
// Find last non-whitespace before (.)
int prevPos = i - 1;
while (prevPos >= start &&
Character.isWhitespace(domain.charAt(prevPos))) {
prevPos--;
}
if (prevPos >= start && domain.charAt(prevPos) == '-') {
return true;
}
// Find first non-whitespace after (.)
int nextPos = i + 1;
while (nextPos < end &&
Character.isWhitespace(domain.charAt(nextPos))) {
nextPos++;
}
if (nextPos < end && domain.charAt(nextPos) == '-') {
return true;
}
}
}
// finally return false if (-) is found no-where.
return false;
}{code}
This is most efficient version of {{hasLeadingOrTrailingHyphensInLabel.}}
If code readability is prioritised over performance, simpler implementations
using {{{}regex{}}}, {{trim()}} and {{split()}} would also work, but the core
principle remains the same: {*}validate hyphen placement before IDN
conversion{*}.
We can add an early length check in the utility method to quickly reject
obviously invalid domains. This way, the complexity stays bounded by
*O(MAX_DOMAIN_LENGTH)* rather than scaling with input size.
{code:java}
if (domain.length() > MAX_DOMAIN_LENGTH) {
return true; // invalid domain
} {code}
Let me know your thoughts on this approach!
{*}Note{*}: I have tested with different scenarios and it seems to be working
for most of the cases, I found white spaces were creating some problem for ex.
{{www. -google.com}} hence I handled it through different while loops.
> DomainValidator accepts hyphens at start/end of domain name with Unicode
> characters
> -----------------------------------------------------------------------------------
>
> Key: VALIDATOR-501
> URL: https://issues.apache.org/jira/browse/VALIDATOR-501
> Project: Commons Validator
> Issue Type: Bug
> Reporter: Victoria Dimitrova
> Priority: Major
> Labels: BUG, validator
>
> The method `DomainValidator.getInstance().isValid()` returns `true` for
> invalid domain names that start or end with a hyphen when the domain name
> contains Unicode characters.
> This behavior is incorrect according to the domain name specifications (RFC
> 1035, RFC 5890), which do not allow:
> - Domain labels starting or ending with a hyphen (`-`)
> Test case that should pass but fails on version 1.10.0:
> {code:java}
> @Test
> void shouldBeInvalid()
> {
> assertAll(() ->
> assertFalse(DomainValidator.getInstance().isValid("-test.fr")),
> () -> assertFalse(DomainValidator.getInstance().isValid("-tést.fr")),
> () -> assertFalse(DomainValidator.getInstance().isValid("test-.fr")),
> () -> assertFalse(DomainValidator.getInstance().isValid("tést-.fr")));
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)