[
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876821#comment-13876821
]
Ken Krugler commented on TIKA-1224:
-----------------------------------
For many languages, parsing needs to be fuzzy (e.g. for C code, without knowing
the values for conditional compilation, it's impossible to accurately parse
many source files). One quick & dirty approach is to use syntax highlighters,
though the deeper question is what exactly to extract as the text - i.e. what
would Tika return that's different from the (original) text?
> Adding Source code (Java, Groovy, C) parser
> -------------------------------------------
>
> Key: TIKA-1224
> URL: https://issues.apache.org/jira/browse/TIKA-1224
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.5
> Reporter: Hong-Thai Nguyen
> Priority: Minor
>
> We can parser some source code file formats:
> text/x-java-source
> text/x-groovy
> text/x-c
> for HTML rendering from code, we can use jhightlight:
> http://www.ohloh.net/p/jhighlight
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)