How to find out the encoding and format of the content stored in the index?

dealmaker Sat, 04 Apr 2009 22:55:07 -0700

Hi,
  I am trying to find out the encoding and format of the content stored in
the index.  I modified the code in BasicIndexFilter.java to store the
content.  But I need to know the encoding of the stored content which
doesn't seem to store this information.  I also need to know whether it's
html, pdf, rss, etc.  I have the following code, but I have to create
Content object which needs the content type which I don't also have, I just
hard code it text/html but I should not.   Please help.  Thanks.


          try {
            contentInOctets = bean.getContent (detail);
          } catch (IOException e) {
            if (LOG.isWarnEnabled()) 
              LOG.warn("GetContent Error", e);

          }

          InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));
          Content content = new Content(sUrl, sUrl, contentInOctets, 
"text/html",
new Metadata(), attConf); 
          detector.autoDetectClues(content, true);
          detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
          String encoding = detector.guessEncoding(content, 
defaultCharEncoding);

          input.setEncoding(encoding);
-- 
View this message in context: 
http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
Sent from the Nutch - User mailing list archive at Nabble.com.

How to find out the encoding and format of the content stored in the index?

Reply via email to