[
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071942#comment-14071942
]
Tyler Palsulich edited comment on TIKA-1373 at 7/23/14 4:52 PM:
The only SAX event in SourceCodeParser is {{xhtml.element(p, codeAsHtml);}}.
codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it _looks_
like the --text isn't returning the text, but it's just that the text content
is html.
I'm not sure how we can turn the jhighlight html tags into SAX events. Tika
HtmlParser? Something like
{code}
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
Renderer renderer = getRenderer(type.toString());
String content = out.toString();
String codeAsHtml = renderer.highlight(name, content, charset.name(),
false);
HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml,
metadata, context);
{code}
was (Author: tpalsulich):
The only SAX event in SourceCodeParser is {{xhtml.element(p, codeAsHtml);}}.
codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it _looks_
like the --text is returning the text, but it's just that the text content is
html.
I'm not sure how we can turn the jhighlight html tags into SAX events. Tika
HtmlParser? Something like
{code}
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
Renderer renderer = getRenderer(type.toString());
String content = out.toString();
String codeAsHtml = renderer.highlight(name, content, charset.name(),
false);
HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml,
metadata, context);
{code}
AutoDetectParser extracts no text when SourceCodeParser is selected
---
Key: TIKA-1373
URL: https://issues.apache.org/jira/browse/TIKA-1373
Project: Tika
Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña
When using the AutoDetectParser in java code, and the SourceCodeParser is
selected (i.e. java files), the handler gets no text:
I have this test program:
{code}
String data = public class HelloWorld {};
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(Text extracted: +bch.toString())
{code}
It returns (using the SourceCodeParser):
{code} Text extracted: {code}
But when I use this code:
{code}
String data = public class HelloWorld {};
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, text/plain);
try { autoDetectParser.parse(bais, bch, metadata, parseContext); }
catch (Exception e) { e.printStackTrace(); }
System.out.println(Text extracted: +bch.toString())
{code}
The Text Parser is used and I get:
{code} Text extracted: public class HelloWorld {} {code}
I have also tested this command:
{code}
java -jar tika-app-1.5.jar -t D:\text.java
(no text)
{code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)