Hi: If you have a look at org.apache.nutch.parse.html.HtmlParser, you can see there are two property are stored in matadata:
Metadata.ORIGINAL_CHAR_ENCODING Metadata.CHAR_ENCODING_FOR_CONVERSION Response.CONTENT_TYPE so I think u can just get these properties in your index plugin in the following way: encoding = parse.getData.getMeta(Metadata.ORIGINAL_CHAR_ENCODING); convEncoding = parse.getData.getMeta(Metadata.CHAR_ENCODING_FOR_CONVERSION) contentType = parse.getData.getMeta(Response.CONTENT_TYPE) and then add them to lucene Document by addField method. good luck yanky 2009/4/6 dealmaker <[email protected]> > > Thanks. Is there similar thing for encoding? I don't want to it to > re-detect the encoding again for performance reason. > > > yanky young wrote: > > > > hi: > > > > there is a index-more plugin that index some information about content > > type. > > u can have a look. > > > > 2009/4/5 dealmaker <[email protected]> > > > >> > >> Hi, > >> I am trying to find out the encoding and format of the content stored > in > >> the index. I modified the code in BasicIndexFilter.java to store the > >> content. But I need to know the encoding of the stored content which > >> doesn't seem to store this information. I also need to know whether > it's > >> html, pdf, rss, etc. I have the following code, but I have to create > >> Content object which needs the content type which I don't also have, I > >> just > >> hard code it text/html but I should not. Please help. Thanks. > >> > >> try { > >> contentInOctets = bean.getContent (detail); > >> } catch (IOException e) { > >> if (LOG.isWarnEnabled()) > >> LOG.warn("GetContent Error", e); > >> > >> } > >> > >> InputSource input = new InputSource(new > >> ByteArrayInputStream(contentInOctets)); > >> Content content = new Content(sUrl, sUrl, contentInOctets, > >> "text/html", > >> new Metadata(), attConf); > >> detector.autoDetectClues(content, true); > >> detector.addClue(sniffCharacterEncoding(contentInOctets), > >> "sniffed"); > >> String encoding = detector.guessEncoding(content, > >> defaultCharEncoding); > >> > >> input.setEncoding(encoding); > >> -- > >> View this message in context: > >> > http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html > >> Sent from the Nutch - User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22895581.html > Sent from the Nutch - User mailing list archive at Nabble.com. > >
