[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-04-24 Thread Benoit Moreau (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979379#comment-13979379
 ] 

Benoit Moreau commented on TIKA-1224:
-

In debug, Tika uses org.apache.tika.SourceCodeParser with x-java-source 
mime-type. It removes all end of lines (why?, mistake? readLine() doesn't 
return \n or/and \r), then gives the result to JHightlight. JHightlight result 
(entire html) is used as argument of characters() method of ContentHandler.

I just start with Tika, but I don't think that is good.

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-04-24 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979614#comment-13979614
 ] 

Hong-Thai Nguyen commented on TIKA-1224:


Thank [~ben.12] for feedback.
For line return problem at output, I created a new issue: TIKA-1279
For -t option in TikaCLI, It's ambiguous on mimetype of java file. It's could 
be text/plain (in this case, TxtParser will be used to return original text as 
is), x-java-source (SourceCodeParser will be used).

For -h option, output is normally something:
{code}
Author: Hong-Thai.Nguyen
Content-Encoding: windows-1252
Content-Length: 4899
Content-Type: text/x-java-source
LoC: 133
creator: Hong-Thai.Nguyen
dc:creator: Hong-Thai.Nguyen
meta:author: Hong-Thai.Nguyen
resourceName: SourceCodeParser.java
{code}
the creator is from 'author' annotation in javadoc.

This parser is quite generic (quick and dirty as mentioned by [~kkrugler]) and 
simplistic. We can make a more dedicate Java source parser and extract more 
metadata (member, attributes...). If you interest this kind of parser, please 
create new issue and eventually an investigation on this work is warmly welcome.

Regards,

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-04-21 Thread Benoit Moreau (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975502#comment-13975502
 ] 

Benoit Moreau commented on TIKA-1224:
-

I'm disappointed because it does not work !

For examples:

 java -jar tika-app-1.5.jar -t Test.java
Output is empty

 java -jar tika-app-1.5.jar -h Test.java
Output is stange

 java -jar tika-app-1.5.jar -T Test.java
Output is what I expect for -h ?
{code:xml}
!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN
 http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; html xmlns=htt
p://www.w3.org/1999/xhtml xml:lang=en lang=en head meta http-equiv=
content-type content=text/html; charset=ISO-8859-1 / meta name=genera
tor content=JHighlight v1.0 (http://jhighlight.dev.java.net) / titleTe
st.java/title link rel=Help href=http://jhighlight.dev.java.net; /
  style type=text/css .java_type { color: rgb(0,44,221); } .java_keyword { c
olor: rgb(0,0,0); font-weight: bold; } .java_javadoc_comment { color: rgb(147,14
7,147); background-color: rgb(247,247,247); font-style: italic; } .java_comment
{ color: rgb(147,147,147); background-color: rgb(247,247,247); } .java_operator
{ color: rgb(0,124,31); } .java_plain { color: rgb(0,0,0); } .java_literal { col
or: rgb(188,0,0); } code { color: rgb(0,0,0); font-family: monospace; font-size:
 12px; white-space: nowrap; } .java_javadoc_tag { color: rgb(147,147,147); backg
round-color: rgb(247,247,247); font-style: italic; font-weight: bold; } .java_se
parator { color: rgb(0,33,255); } h1 { font-family: sans-serif; font-size: 16pt;
 font-weight: bold; color: rgb(0,0,0); background: rgb(210,210,210); border: sol
id 1px black; padding: 5px; text-align: center; } /style /head body h
1Test.java/h1codespan class=java_javadoc_comment/**nbsp;*nbsp;Classn
bsp;Test.nbsp;*nbsp;*nbsp;/spanspan class=java_javadoc_tag@author/span
span class=java_javadoc_commentnbsp;ben.12nbsp;*//spanspan class=java
_keywordpublic/spanspan class=java_plainnbsp;/spanspan class=java_k
eywordclass/spanspan class=java_plainnbsp;/spanspan class=java_type
Test/spanspan class=java_plainnbsp;/spanspan class=java_separator
{/spanspan class=java_plainnbsp;nbsp;/spanspan class=java_comment/
/nbsp;Classnbsp;Test}/spanbr / /code /body /html
{code}
But all is in only one line, indentation is lost and file name appears at 
beginning.
Author is not in head meta tags.
The last } is highlighted as a comment.

\\
My input java file:
{code:title=Test.java}
/**
 * Class Test.
 *
 * @author ben.12
 */
public class Test {
// Class Test
}
{code}

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-04-21 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975519#comment-13975519
 ] 

Nick Burch commented on TIKA-1224:
--

Benoit - Does Tika correctly detect your files? The right parser won't kick in 
if Tika is confused about the mime type

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-02-03 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889491#comment-13889491
 ] 

Hong-Thai Nguyen commented on TIKA-1224:


Commited on 1563902

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-01-21 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13877343#comment-13877343
 ] 

Hong-Thai Nguyen commented on TIKA-1224:


I agree that parsing deeply each language is not simple. This work (already 
done) is just providing HTML format of source languages and some metadata 
possible (as author, version ...) extracting from javadoc comment and probably 
interesting others as LoC. When we need more detailed result on a language, we 
must implement a dedicated parser.
This parser is useful in search application.

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-01-20 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13876821#comment-13876821
 ] 

Ken Krugler commented on TIKA-1224:
---

For many languages, parsing needs to be fuzzy (e.g. for C code, without knowing 
the values for conditional compilation, it's impossible to accurately parse 
many source files). One quick  dirty approach is to use syntax highlighters, 
though the deeper question is what exactly to extract as the text - i.e. what 
would Tika return that's different from the (original) text?

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)