Re: How to find out the encoding and format of the content stored in the index?

yanky young Sat, 04 Apr 2009 23:28:54 -0700

hi:

there is a index-more plugin that index some information about content type.
u can have a look.


2009/4/5 dealmaker <[email protected]>

>
> Hi,
>  I am trying to find out the encoding and format of the content stored in
> the index.  I modified the code in BasicIndexFilter.java to store the
> content.  But I need to know the encoding of the stored content which
> doesn't seem to store this information.  I also need to know whether it's
> html, pdf, rss, etc.  I have the following code, but I have to create
> Content object which needs the content type which I don't also have, I just
> hard code it text/html but I should not.   Please help.  Thanks.
>
>          try {
>            contentInOctets = bean.getContent (detail);
>          } catch (IOException e) {
>            if (LOG.isWarnEnabled())
>              LOG.warn("GetContent Error", e);
>
>          }
>
>          InputSource input = new InputSource(new
> ByteArrayInputStream(contentInOctets));
>          Content content = new Content(sUrl, sUrl, contentInOctets,
> "text/html",
> new Metadata(), attConf);
>          detector.autoDetectClues(content, true);
>          detector.addClue(sniffCharacterEncoding(contentInOctets),
> "sniffed");
>          String encoding = detector.guessEncoding(content,
> defaultCharEncoding);
>
>          input.setEncoding(encoding);
> --
> View this message in context:
> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: How to find out the encoding and format of the content stored in the index?

Reply via email to