[jira] [Commented] (TIKA-2010) Unable to get value when header is incorrect</span></a></span> </h1> <p class="darkgray font13"> <span class="sender pipe"><a href="/search?l=dev@tika.apache.org&q=from:%22Ken+Krugler+%5C%28JIRA%5C%29%22" rel="nofollow"><span itemprop="author" itemscope itemtype="http://schema.org/Person"><span itemprop="name">Ken Krugler (JIRA)</span></span></a></span> <span class="date"><a href="/search?l=dev@tika.apache.org&q=date:20160615" rel="nofollow">Wed, 15 Jun 2016 07:27:21 -0700</a></span> </p> </div> <div itemprop="articleBody" class="msgBody"> <!--X-Body-of-Message--> <pre> [ <a rel="nofollow" href="https://issues.apache.org/jira/browse/TIKA-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331829#comment-15331829">https://issues.apache.org/jira/browse/TIKA-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331829#comment-15331829</a> ] </pre><pre> Ken Krugler commented on TIKA-2010: ----------------------------------- Would it be possible for you to try this broken HTML with JSoup? Asking because we're discussing switching to JSoup over in [TIKA-1599]. > Unable to get <title> value when header is incorrect > ---------------------------------------------------- > > Key: TIKA-2010 > URL: <a rel="nofollow" href="https://issues.apache.org/jira/browse/TIKA-2010">https://issues.apache.org/jira/browse/TIKA-2010</a> > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.12 > Reporter: Florent Valdelievre > > A lot of websites don't have a valid data within <head></head> tag. However, > even if header data are invalid(missplaced tag etc.) we should be able to get > title tag value if present. > Please find below a straightforward Unit Test to reproduce the problem. You > will noticed I have added an anchor in between <head><a></a></head> tags > which is not correct. If you remove it, it find title value. > {code:java} > import java.io.ByteArrayInputStream; > import java.io.IOException; > import java.nio.charset.Charset; > import java.nio.file.Files; > import java.nio.file.Paths; > import org.apache.hadoop.conf.Configuration; > import org.apache.html.dom.HTMLDocumentImpl; > import org.apache.nutch.parse.html.DOMBuilder; > import org.apache.nutch.parse.tika.DOMContentUtils; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.junit.Assert; > import org.junit.Before; > import org.junit.Test; > import org.w3c.dom.DocumentFragment; > public class TestTikaGetTitleWithInvalidHeaders { > private Configuration conf; > static byte[] readFile(String path, Charset encoding) throws > IOException { > return Files.readAllBytes(Paths.get(path)); > } > private final static String WEBPAGE = > "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML+RDFa > 1.0//EN\" \"<a rel="nofollow" href="http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd\"">http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd\"</a>;>" > + "<html>" > + "<head>" > +"<a > href=\"<a rel="nofollow" href="https://plus.google.com/113911985765464238166\"">https://plus.google.com/113911985765464238166\"</a>; > rel=\"publisher\">Google+</a> " > + "<title>Welcome!</title>" > + "</head>" > + "<body>" > + "content" > + "</body>" > + "</html>"; > > @Before > public void setUp() throws Exception { > conf = new Configuration(); > } > @Test > public void testGetTitle() { > HTMLDocumentImpl doc = new HTMLDocumentImpl(); > doc.setErrorChecking(false); > DocumentFragment root = doc.createDocumentFragment(); > Parser parser = new org.apache.tika.parser.html.HtmlParser(); > DOMBuilder domBuilder = new DOMBuilder(doc, root); > try { > parser.parse(new > ByteArrayInputStream(WEBPAGE.getBytes()), domBuilder, new Metadata(), new > ParseContext()); > } catch (Exception e) { > e.printStackTrace(); > } > StringBuffer sb = new StringBuffer(); > new DOMContentUtils(conf).getTitle(sb, root); > Assert.assertEquals("Welcome!", sb.toString()); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) </pre> </div> <div class="msgButtons margintopdouble"> <ul class="overflow"> <li class="msgButtonItems"><a class="button buttonleft " accesskey="p" href="msg18055.html">Previous message</a></li> <li class="msgButtonItems textaligncenter"><a class="button" accesskey="c" href="index.html#18056">View by thread</a></li> <li class="msgButtonItems textaligncenter"><a class="button" accesskey="i" href="maillist.html#18056">View by date</a></li> <li class="msgButtonItems textalignright"><a class="button buttonright " accesskey="n" href="msg18060.html">Next message</a></li> </ul> </div> <a name="tslice"></a> <div class="tSliceList margintopdouble"> <ul class="icons monospace"> <li class="icons-email tSliceCur"><span class="subject">[jira] [Commented] (TIKA-2010) Unable to get &l...</span> <span class="sender italic">Ken Krugler (JIRA)</span></li> <li><ul> <li class="icons-email"><span class="subject"><a href="msg18060.html">[jira] [Commented] (TIKA-2010) Unable to g...</a></span> <span class="sender italic">Florent Valdelievre (JIRA)</span></li> <li class="icons-email"><span class="subject"><a href="msg18062.html">[jira] [Commented] (TIKA-2010) Unable to g...</a></span> <span class="sender italic">Ken Krugler (JIRA)</span></li> </ul> </ul> </div> <div class="overflow msgActions margintopdouble"> <div class="msgReply" > <h2> Reply via email to </h2> <form method="POST" action="/mailto.php"> <input type="hidden" name="subject" value="[jira] [Commented] (TIKA-2010) Unable to get <title> value when header is incorrect"> <input type="hidden" name="msgid" value="JIRA.12979373.1465999466000.5338.1466000829453@Atlassian.JIRA"> <input type="hidden" name="relpath" value="dev@tika.apache.org/msg18056.html"> <input type="submit" value=" Ken Krugler (JIRA) "> </form> </div> </div> </div> <div class="aside" role="complementary"> <div class="logo"> <a href="/"><img src="/logo.png" width=247 height=88 alt="The Mail Archive"></a> </div> <form class="overflow" action="/search" method="get"> <input type="hidden" name="l" value="dev@tika.apache.org"> <label class="hidden" for="q">Search the site</label> <input class="submittext" type="text" id="q" name="q" placeholder="Search dev"> <input class="submitbutton" name="submit" type="image" src="/submit.png" alt="Submit"> </form> <div class="nav margintop" id="nav" role="navigation"> <ul class="icons font16"> <li class="icons-home"><a href="/">The Mail Archive home</a></li> <li class="icons-list"><a href="/dev@tika.apache.org/">dev - all messages</a></li> <li class="icons-about"><a href="/dev@tika.apache.org/info.html">dev - about the list</a></li> <li class="icons-expand"><a href="/search?l=dev@tika.apache.org&q=subject:%22%5C%5Bjira%5C%5D+%5C%5BCommented%5C%5D+%5C%28TIKA%5C-2010%5C%29+Unable+to+get+%3Ctitle%3E+value+when+header+is+incorrect%22&o=newest&f=1" title="e" id="e">Expand</a></li> <li class="icons-prev"><a href="msg18055.html" title="p">Previous message</a></li> <li class="icons-next"><a href="msg18060.html" title="n">Next message</a></li> </ul> </div> <div class="listlogo margintopdouble"> </div> <div class="margintopdouble"> </div> </div> </div> <div class="footer" role="contentinfo"> <ul> <li><a href="/">The Mail Archive home</a></li> <li><a href="/faq.html#newlist">Add your mailing list</a></li> <li><a href="/faq.html">FAQ</a></li> <li><a href="/faq.html#support">Support</a></li> <li><a href="/faq.html#privacy">Privacy</a></li> <li class="darkgray">JIRA.12979373.1465999466000.5338.1466000829453@Atlassian.JIRA</li> </ul> </div> </body> </html>