[ 
https://issues.apache.org/jira/browse/TIKA-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156637#comment-17156637
 ] 

chenshuming commented on TIKA-3127:
-----------------------------------

run test in 2.0.0-SNAPSHOT , this problem exist.

 

use this code to test:
{code:java}
@Test
public void testParseHref() throws IOException, TikaException, SAXException {
    String path = "/test-documents/testHTMLHref.html";
    LinkContentHandler link = new LinkContentHandler();
    Metadata metadata = new Metadata();

    try (InputStream stream = HtmlParserTest.class.getResourceAsStream(path)) {
        new HtmlParser().parse(stream, link, metadata, new ParseContext());
    }

    assertEquals("", link.getLinks().get(0).getUri());
}{code}
 

The test fail when test file‘s content is :
{code:java}
<html>
   <a href>link</a>
</html>
{code}
The test pass when test file‘s content is :
{code:java}
<html>
   <a href=>link</a>
</html>
{code}
 

The problem seems to occurs in method 
org.ccil.cowan.tagsoup.HTMLScanner.scan(Reader, ScanHandler) . 

> When using html parser any empty attribute sets value to attribute name e.g. 
> <a href>link</a> gives href="href"
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3127
>                 URL: https://issues.apache.org/jira/browse/TIKA-3127
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.16, 1.24
>            Reporter: Milan Vereščák
>            Priority: Major
>
> Shouldn't it be rather empty string? It was present in 1.16 version but also 
> in 1.24.1
> Thank you for your response.
> [html 
> specification|https://html.spec.whatwg.org/multipage/syntax.html#attributes-2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to