[ https://issues.apache.org/jira/browse/TIKA-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156637#comment-17156637 ]
chenshuming commented on TIKA-3127: ----------------------------------- run test in 2.0.0-SNAPSHOT , this problem exist. use this code to test: {code:java} @Test public void testParseHref() throws IOException, TikaException, SAXException { String path = "/test-documents/testHTMLHref.html"; LinkContentHandler link = new LinkContentHandler(); Metadata metadata = new Metadata(); try (InputStream stream = HtmlParserTest.class.getResourceAsStream(path)) { new HtmlParser().parse(stream, link, metadata, new ParseContext()); } assertEquals("", link.getLinks().get(0).getUri()); }{code} The test fail when test file‘s content is : {code:java} <html> <a href>link</a> </html> {code} The test pass when test file‘s content is : {code:java} <html> <a href=>link</a> </html> {code} The problem seems to occurs in method org.ccil.cowan.tagsoup.HTMLScanner.scan(Reader, ScanHandler) . > When using html parser any empty attribute sets value to attribute name e.g. > <a href>link</a> gives href="href" > --------------------------------------------------------------------------------------------------------------- > > Key: TIKA-3127 > URL: https://issues.apache.org/jira/browse/TIKA-3127 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.16, 1.24 > Reporter: Milan Vereščák > Priority: Major > > Shouldn't it be rather empty string? It was present in 1.16 version but also > in 1.24.1 > Thank you for your response. > [html > specification|https://html.spec.whatwg.org/multipage/syntax.html#attributes-2] -- This message was sent by Atlassian Jira (v8.3.4#803005)