Re: How to find out the encoding and format of the content stored in the index?

yanky young Sun, 05 Apr 2009 10:50:32 -0700

Hi:

If you have a look at org.apache.nutch.parse.html.HtmlParser, you can see
there are two property are stored in matadata:


Metadata.ORIGINAL_CHAR_ENCODING
Metadata.CHAR_ENCODING_FOR_CONVERSION
Response.CONTENT_TYPE

so I think u can just get these properties in your index plugin in the
following way:

encoding = parse.getData.getMeta(Metadata.ORIGINAL_CHAR_ENCODING);
convEncoding = parse.getData.getMeta(Metadata.CHAR_ENCODING_FOR_CONVERSION)
contentType = parse.getData.getMeta(Response.CONTENT_TYPE)

and then add them to lucene Document by addField method.

good luck

yanky


2009/4/6 dealmaker <[email protected]>

>
> Thanks.  Is there similar thing for encoding?  I don't want to it to
> re-detect the encoding again for performance reason.
>
>
> yanky young wrote:
> >
> > hi:
> >
> > there is a index-more plugin that index some information about content
> > type.
> > u can have a look.
> >
> > 2009/4/5 dealmaker <[email protected]>
> >
> >>
> >> Hi,
> >>  I am trying to find out the encoding and format of the content stored
> in
> >> the index.  I modified the code in BasicIndexFilter.java to store the
> >> content.  But I need to know the encoding of the stored content which
> >> doesn't seem to store this information.  I also need to know whether
> it's
> >> html, pdf, rss, etc.  I have the following code, but I have to create
> >> Content object which needs the content type which I don't also have, I
> >> just
> >> hard code it text/html but I should not.   Please help.  Thanks.
> >>
> >>          try {
> >>            contentInOctets = bean.getContent (detail);
> >>          } catch (IOException e) {
> >>            if (LOG.isWarnEnabled())
> >>              LOG.warn("GetContent Error", e);
> >>
> >>          }
> >>
> >>          InputSource input = new InputSource(new
> >> ByteArrayInputStream(contentInOctets));
> >>          Content content = new Content(sUrl, sUrl, contentInOctets,
> >> "text/html",
> >> new Metadata(), attConf);
> >>          detector.autoDetectClues(content, true);
> >>          detector.addClue(sniffCharacterEncoding(contentInOctets),
> >> "sniffed");
> >>          String encoding = detector.guessEncoding(content,
> >> defaultCharEncoding);
> >>
> >>          input.setEncoding(encoding);
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22890765.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/How-to-find-out-the-encoding-and-format-of-the-content-stored-in-the-index--tp22890765p22895581.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: How to find out the encoding and format of the content stored in the index?

Reply via email to