[ https://issues.apache.org/jira/browse/TIKA-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149992#comment-13149992 ]
Dave Meikle commented on TIKA-663: ---------------------------------- Hi, Not sure how you are using Tika but I would have thought you would have been hit by this from Tika 0.5 (the change to TagSoup for HTML parsing), as the tagsoup parser does not pickup the <% and %> tags within the JSP thus the content is not appearing in the output. Tika 0.8 used TagSoup so would have though you would see the same behaviour in that version also? Suspect we will want to add a new mime-type entry for jsp files to pass them to the plain text parser as the existing glob mapping will be getting beat by the magic mapping for HTML in these files. Something like this should do the trick[1]: <mime-info> ... <mime-type type="application/x-httpd-jsp"> <sub-class-of type="text/plain"/> <magic priority="50"> <match value="<%@" type="string" offset="0"/> </magic> <glob pattern="*.jsp"/> </mime-type> ... </mime-info> That will remove some of the other metadata extraction from the {{HtmlParser}} for anyone else who has been using Tika to parse JSP files before 0.5 (title, etc) but will give then context and be correct for the original intent based on the glob in the existing mime-types.xml Not sure if anyone has any objections to this? If not, I will make the change - as I would expect Tika to treat a JSP file as text to get the script contents as well. Cheers, Dave [1] You can place the above XML in custom-mimetypes.xml within the package org.apache.tika.mime on your classpath to try this out. > JSP files data extraction failed > -------------------------------- > > Key: TIKA-663 > URL: https://issues.apache.org/jira/browse/TIKA-663 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.9 > Environment: Windows, JAva 6 > Reporter: samraj > Attachments: File_1.jsp, File_2.jsp, File_3.jsp > > > We have worked with tika extraction. In 0.8 jsp file contents extracted > well.. But in 0.9 the same files are not extracted well. Pls give the solution -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira