Thanks.  Is there similar thing for encoding?  I don't want to it to
re-detect the encoding again for performance reason.


yanky young wrote:
> 
> hi:
> 
> there is a index-more plugin that index some information about content
> type.
> u can have a look.
> 
> 2009/4/5 dealmaker <[email protected]>
> 
>>
>> Hi,
>>  I am trying to find out the encoding and format of the content stored in
>> the index.  I modified the code in BasicIndexFilter.java to store the
>> content.  But I need to know the encoding of the stored content which
>> doesn't seem to store this information.  I also need to know whether it's
>> html, pdf, rss, etc.  I have the following code, but I have to create
>> Content object which needs the content type which I don't also have, I
>> just
>> hard code it text/html but I should not.   Please help.  Thanks.
>>
>>          try {
>>            contentInOctets = bean.getContent (detail);
>>          } catch (IOException e) {
>>            if (LOG.isWarnEnabled())
>>              LOG.warn("GetContent Error", e);
>>
>>          }
>>
>>          InputSource input = new InputSource(new
>> ByteArrayInputStream(contentInOctets));
>>          Content content = new Content(sUrl, sUrl, contentInOctets,
>> "text/html",
>> new Metadata(), attConf);
>>          detector.autoDetectClues(content, true);
>>          detector.addClue(sniffCharacterEncoding(contentInOctets),
>> "sniffed");
>>          String encoding = detector.guessEncoding(content,
>> defaultCharEncoding);
>>
>>          input.setEncoding(encoding);
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22895581.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to