Hi,
I am trying to find out the encoding and format of the content stored in
the index. I modified the code in BasicIndexFilter.java to store the
content. But I need to know the encoding of the stored content which
doesn't seem to store this information. I also need to know whether it's
html, pdf, rss, etc. I have the following code, but I have to create
Content object which needs the content type which I don't also have, I just
hard code it text/html but I should not. Please help. Thanks.
try {
contentInOctets = bean.getContent (detail);
} catch (IOException e) {
if (LOG.isWarnEnabled())
LOG.warn("GetContent Error", e);
}
InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));
Content content = new Content(sUrl, sUrl, contentInOctets,
"text/html",
new Metadata(), attConf);
detector.autoDetectClues(content, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(content,
defaultCharEncoding);
input.setEncoding(encoding);
--
View this message in context:
http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
Sent from the Nutch - User mailing list archive at Nabble.com.