Seungmin123 opened a new issue, #15754:
URL: https://github.com/apache/lucene/issues/15754
### Description
`HTMLStripCharFilter` fails to recognize the closing double quote of an
attribute value if the value ends with an equals sign (`=`) and is immediately
followed by the tag closer (`>`). This causes the filter to consume all
subsequent text and tags as part of the attribute value until the next double
quote is encountered in the document.
### Reproduction
The following test case reproduces the issue:
```java
@Test
public void testEqualsAtEndOfAttributeBug() throws IOException {
// Input: The first tag's attribute ends with =" followed by >
String input = "<a href=\"https://www.example.com/?test=\">example</a>
this gets discarded <a href=\"next\">example2</a>";
// Expected: "example this gets discarded example2" (Only tags should be
stripped)
// Actual: "example2" (Content is discarded until the next double quote
in the second tag is found)
Reader reader = new HTMLStripCharFilter(new StringReader(input));
StringBuilder sb = new StringBuilder();
char[] buffer = new char[1024];
int len;
while ((len = reader.read(buffer)) != -1) {
sb.append(buffer, 0, len);
}
String actual = sb.toString().trim().replaceAll("\\s+", " ");
// This fails because the output is just "example2"
assertEquals("example this gets discarded example2", actual);
}
```
### Analysis
The issue seems to be specific to the sequence `=\"> `.
Interestingly:
- It works fine if there is a space after the equals: `href=\"test= \">`
- It works fine with single quotes: `href='test='>`
- It works fine if there are no quotes: `href=test=>`
- It works fine if there are multiple equals: `href=\"test==\">`
It appears to be a state machine issue in the parser where the combination
of double quotes and a trailing equals sign confuses the quote detection.
### Environment
- Lucene Version: 10.3.2
- JDK: 21
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]