[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979379#comment-13979379 ] Benoit Moreau commented on TIKA-1224: - In debug, Tika uses org.apache.tika.SourceCodeParser with x-java-source mime-type. It removes all end of lines (why?, mistake? readLine() doesn't return \n or/and \r), then gives the result to JHightlight. JHightlight result (entire html) is used as argument of characters() method of ContentHandler. I just start with Tika, but I don't think that is good. Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979614#comment-13979614 ] Hong-Thai Nguyen commented on TIKA-1224: Thank [~ben.12] for feedback. For line return problem at output, I created a new issue: TIKA-1279 For -t option in TikaCLI, It's ambiguous on mimetype of java file. It's could be text/plain (in this case, TxtParser will be used to return original text as is), x-java-source (SourceCodeParser will be used). For -h option, output is normally something: {code} Author: Hong-Thai.Nguyen Content-Encoding: windows-1252 Content-Length: 4899 Content-Type: text/x-java-source LoC: 133 creator: Hong-Thai.Nguyen dc:creator: Hong-Thai.Nguyen meta:author: Hong-Thai.Nguyen resourceName: SourceCodeParser.java {code} the creator is from 'author' annotation in javadoc. This parser is quite generic (quick and dirty as mentioned by [~kkrugler]) and simplistic. We can make a more dedicate Java source parser and extract more metadata (member, attributes...). If you interest this kind of parser, please create new issue and eventually an investigation on this work is warmly welcome. Regards, Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975502#comment-13975502 ] Benoit Moreau commented on TIKA-1224: - I'm disappointed because it does not work ! For examples: java -jar tika-app-1.5.jar -t Test.java Output is empty java -jar tika-app-1.5.jar -h Test.java Output is stange java -jar tika-app-1.5.jar -T Test.java Output is what I expect for -h ? {code:xml} !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; html xmlns=htt p://www.w3.org/1999/xhtml xml:lang=en lang=en head meta http-equiv= content-type content=text/html; charset=ISO-8859-1 / meta name=genera tor content=JHighlight v1.0 (http://jhighlight.dev.java.net) / titleTe st.java/title link rel=Help href=http://jhighlight.dev.java.net; / style type=text/css .java_type { color: rgb(0,44,221); } .java_keyword { c olor: rgb(0,0,0); font-weight: bold; } .java_javadoc_comment { color: rgb(147,14 7,147); background-color: rgb(247,247,247); font-style: italic; } .java_comment { color: rgb(147,147,147); background-color: rgb(247,247,247); } .java_operator { color: rgb(0,124,31); } .java_plain { color: rgb(0,0,0); } .java_literal { col or: rgb(188,0,0); } code { color: rgb(0,0,0); font-family: monospace; font-size: 12px; white-space: nowrap; } .java_javadoc_tag { color: rgb(147,147,147); backg round-color: rgb(247,247,247); font-style: italic; font-weight: bold; } .java_se parator { color: rgb(0,33,255); } h1 { font-family: sans-serif; font-size: 16pt; font-weight: bold; color: rgb(0,0,0); background: rgb(210,210,210); border: sol id 1px black; padding: 5px; text-align: center; } /style /head body h 1Test.java/h1codespan class=java_javadoc_comment/**nbsp;*nbsp;Classn bsp;Test.nbsp;*nbsp;*nbsp;/spanspan class=java_javadoc_tag@author/span span class=java_javadoc_commentnbsp;ben.12nbsp;*//spanspan class=java _keywordpublic/spanspan class=java_plainnbsp;/spanspan class=java_k eywordclass/spanspan class=java_plainnbsp;/spanspan class=java_type Test/spanspan class=java_plainnbsp;/spanspan class=java_separator {/spanspan class=java_plainnbsp;nbsp;/spanspan class=java_comment/ /nbsp;Classnbsp;Test}/spanbr / /code /body /html {code} But all is in only one line, indentation is lost and file name appears at beginning. Author is not in head meta tags. The last } is highlighted as a comment. \\ My input java file: {code:title=Test.java} /** * Class Test. * * @author ben.12 */ public class Test { // Class Test } {code} Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975519#comment-13975519 ] Nick Burch commented on TIKA-1224: -- Benoit - Does Tika correctly detect your files? The right parser won't kick in if Tika is confused about the mime type Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889491#comment-13889491 ] Hong-Thai Nguyen commented on TIKA-1224: Commited on 1563902 Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13877343#comment-13877343 ] Hong-Thai Nguyen commented on TIKA-1224: I agree that parsing deeply each language is not simple. This work (already done) is just providing HTML format of source languages and some metadata possible (as author, version ...) extracting from javadoc comment and probably interesting others as LoC. When we need more detailed result on a language, we must implement a dedicated parser. This parser is useful in search application. Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876821#comment-13876821 ] Ken Krugler commented on TIKA-1224: --- For many languages, parsing needs to be fuzzy (e.g. for C code, without knowing the values for conditional compilation, it's impossible to accurately parse many source files). One quick dirty approach is to use syntax highlighters, though the deeper question is what exactly to extract as the text - i.e. what would Tika return that's different from the (original) text? Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.1.5#6160)