Thanks. Is there similar thing for encoding? I don't want to it to re-detect the encoding again for performance reason.
yanky young wrote: > > hi: > > there is a index-more plugin that index some information about content > type. > u can have a look. > > 2009/4/5 dealmaker <[email protected]> > >> >> Hi, >> I am trying to find out the encoding and format of the content stored in >> the index. I modified the code in BasicIndexFilter.java to store the >> content. But I need to know the encoding of the stored content which >> doesn't seem to store this information. I also need to know whether it's >> html, pdf, rss, etc. I have the following code, but I have to create >> Content object which needs the content type which I don't also have, I >> just >> hard code it text/html but I should not. Please help. Thanks. >> >> try { >> contentInOctets = bean.getContent (detail); >> } catch (IOException e) { >> if (LOG.isWarnEnabled()) >> LOG.warn("GetContent Error", e); >> >> } >> >> InputSource input = new InputSource(new >> ByteArrayInputStream(contentInOctets)); >> Content content = new Content(sUrl, sUrl, contentInOctets, >> "text/html", >> new Metadata(), attConf); >> detector.autoDetectClues(content, true); >> detector.addClue(sniffCharacterEncoding(contentInOctets), >> "sniffed"); >> String encoding = detector.guessEncoding(content, >> defaultCharEncoding); >> >> input.setEncoding(encoding); >> -- >> View this message in context: >> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22895581.html Sent from the Nutch - User mailing list archive at Nabble.com.
