[jira] [Comment Edited] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071643#comment-14071643
 ] 

Hong-Thai Nguyen edited comment on TIKA-1373 at 7/23/14 1:42 PM:
-

Can you format your description with {noformat}{code}{noformat} annotation and 
if I understand well the output of 1st section is empty ?


was (Author: thaichat04):
Can you format your description with {code} annotation and if I understand well 
the output of 1st section is empty ?

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 It returns (using the SourceCodeParser): 
  Text extracted: 
 But when I use this code:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 The Text Parser is used and I get:
  Text extracted: public class HelloWorld {}
 I have also tested this command: 
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071942#comment-14071942
 ] 

Tyler Palsulich edited comment on TIKA-1373 at 7/23/14 4:52 PM:


The only SAX event in SourceCodeParser is {{xhtml.element(p, codeAsHtml);}}. 
codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it  _looks_ 
like the --text isn't returning the text, but it's just that the text content 
is html.

I'm not sure how we can turn the jhighlight html tags into SAX events. Tika 
HtmlParser? Something like
{code}
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
Renderer renderer = getRenderer(type.toString());
String content = out.toString();
String codeAsHtml = renderer.highlight(name, content, charset.name(), 
false);
HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml, 
metadata, context);
{code}


was (Author: tpalsulich):
The only SAX event in SourceCodeParser is {{xhtml.element(p, codeAsHtml);}}. 
codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it  _looks_ 
like the --text is returning the text, but it's just that the text content is 
html.

I'm not sure how we can turn the jhighlight html tags into SAX events. Tika 
HtmlParser? Something like
{code}
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
Renderer renderer = getRenderer(type.toString());
String content = out.toString();
String codeAsHtml = renderer.highlight(name, content, charset.name(), 
false);
HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml, 
metadata, context);
{code}

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 {code}
 The Text Parser is used and I get:
 {code}  Text extracted: public class HelloWorld {} {code}
 I have also tested this command: 
 {code}
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)