[jira] [Commented] (TIKA-663) JSP files data extraction failed

Dave Meikle (Commented) (JIRA) Mon, 14 Nov 2011 14:12:14 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149992#comment-13149992
 ]


Dave Meikle commented on TIKA-663:
----------------------------------

Hi,

Not sure how you are using Tika but I would have thought you would have been 
hit by this from Tika 0.5 (the change to TagSoup for HTML parsing), as the 
tagsoup parser does not pickup the <% and %> tags within the JSP thus the 
content is not appearing in the output.

Tika 0.8 used TagSoup so would have though you would see the same behaviour in 
that version also?

Suspect we will want to add a new mime-type entry for jsp files to pass them to 
the plain text parser as the existing glob mapping will be getting beat by the 
magic mapping for HTML in these files.  Something like this should do the 
trick[1]:

<mime-info>
...
  <mime-type type="application/x-httpd-jsp">
    <sub-class-of type="text/plain"/>
    <magic priority="50">
      <match value="&lt;%@" type="string" offset="0"/>
    </magic>
    <glob pattern="*.jsp"/>
  </mime-type>
  ...
</mime-info>

That will remove some of the other metadata extraction from the {{HtmlParser}} 
for anyone else who has been using Tika to parse JSP files before 0.5 (title, 
etc) but will give then context and be correct for the original intent based on 
the glob in the existing mime-types.xml

Not sure if anyone has any objections to this?  If not, I will make the change 
- as I would expect Tika to treat a JSP file as text to get the script contents 
as well.

Cheers,
Dave

[1] You can place the above XML in custom-mimetypes.xml within the package 
org.apache.tika.mime on your classpath to try this out.


                
> JSP files data extraction failed
> --------------------------------
>
>                 Key: TIKA-663
>                 URL: https://issues.apache.org/jira/browse/TIKA-663
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows, JAva 6
>            Reporter: samraj
>         Attachments: File_1.jsp, File_2.jsp, File_3.jsp
>
>
> We have worked with tika extraction. In 0.8 jsp file contents extracted 
> well.. But in 0.9 the same files are not extracted well. Pls give the solution

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-663) JSP files data extraction failed

Reply via email to