thank you agin Lewis,
but do you think that my strange content field it's for my cause?
beacuse I disabled the indexing of about all field.

this is my schema:

 <fields>
        <field name="id" type="string" stored="true" indexed="true"/>
        <!-- core fields -->
        <field name="segment" type="string" stored="true" indexed="false"/>
        <field name="digest" type="string" stored="true" indexed="false"/>
        <field name="boost" type="float" stored="true" indexed="false"/>
        <!-- fields for index-basic plugin -->
        <field name="host" type="url" stored="false" indexed="false"/>
        <field name="site" type="string" stored="true" indexed="false"/>
        <field name="url" type="url" stored="true" indexed="false"
            required="true"/>
        <field name="content" type="text" stored="true" indexed="true"/>
        <field name="title" type="text" stored="true" indexed="false"/>
        <field name="cache" type="string" stored="true" indexed="false"/>
        <field name="tstamp" type="date" stored="true" indexed="false"/>
        <!-- fields for index-anchor plugin -->
        <field name="anchor" type="string" stored="true" indexed="false"
            multiValued="true"/>
        <!-- fields for index-more plugin -->
        <field name="type" type="string" stored="true" indexed="false"
            multiValued="true"/>
        <field name="contentLength" type="long" stored="true"
            indexed="false"/>
        <field name="lastModified" type="date" stored="false"
            indexed="false"/>
        <field name="date" type="date" stored="true" indexed="false"/>
        <!-- fields for languageidentifier plugin -->
        <field name="lang" type="string" stored="true" indexed="false"/>
        <!-- fields for subcollection plugin -->
        <field name="subcollection" type="string" stored="true"
            indexed="false" multiValued="true"/>
        <!-- fields for feed plugin (tag is also used by
microformats-reltag)-->
        <field name="author" type="string" stored="true" indexed="true"/>
        <field name="tag" type="string" stored="true" indexed="true"
multiValued="false"/>
        <field name="feed" type="string" stored="true" indexed="false"/>
        <field name="publishedDate" type="date" stored="true"
            indexed="false"/>
        <field name="updatedDate" type="date" stored="true"
            indexed="false"/>
        <!-- fields for creativecommons plugin -->
        <field name="cc" type="string" stored="true" indexed="true"
            multiValued="true"/>
    </fields>

what do you think?

alessio


Il giorno 07 aprile 2012 21:57, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> ha scritto:

> From the limited HTML that I've seen I can only assume that the offending
> xhtml is in the content field.
>
> If this is the case then you will need to write a custom plugin
> implementation that removes this. There is loads of info allowing you to
> get up to speed with plugins on our wiki.[0]
>
> Once you have something that requires help get on to the list and let us
> know.
>
> Lewis
>
> [0] http://wiki.apache.org/nutch/PluginCentral
>
> On Sat, Apr 7, 2012 at 2:33 PM, alessio crisantemi <
> alessio.crisant...@gmail.com> wrote:
>
> > may be it'd my cause with my schema?
> > I chose for inex about only title, author and content.
> >
> > can you help me for setting a parsefilter?
> > thank you
> > alessio
> >
> >
>

Reply via email to