hi: there is a index-more plugin that index some information about content type. u can have a look.
2009/4/5 dealmaker <[email protected]> > > Hi, > I am trying to find out the encoding and format of the content stored in > the index. I modified the code in BasicIndexFilter.java to store the > content. But I need to know the encoding of the stored content which > doesn't seem to store this information. I also need to know whether it's > html, pdf, rss, etc. I have the following code, but I have to create > Content object which needs the content type which I don't also have, I just > hard code it text/html but I should not. Please help. Thanks. > > try { > contentInOctets = bean.getContent (detail); > } catch (IOException e) { > if (LOG.isWarnEnabled()) > LOG.warn("GetContent Error", e); > > } > > InputSource input = new InputSource(new > ByteArrayInputStream(contentInOctets)); > Content content = new Content(sUrl, sUrl, contentInOctets, > "text/html", > new Metadata(), attConf); > detector.autoDetectClues(content, true); > detector.addClue(sniffCharacterEncoding(contentInOctets), > "sniffed"); > String encoding = detector.guessEncoding(content, > defaultCharEncoding); > > input.setEncoding(encoding); > -- > View this message in context: > http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html > Sent from the Nutch - User mailing list archive at Nabble.com. > >
