thank you agin Lewis, but do you think that my strange content field it's for my cause? beacuse I disabled the indexing of about all field.
this is my schema: <fields> <field name="id" type="string" stored="true" indexed="true"/> <!-- core fields --> <field name="segment" type="string" stored="true" indexed="false"/> <field name="digest" type="string" stored="true" indexed="false"/> <field name="boost" type="float" stored="true" indexed="false"/> <!-- fields for index-basic plugin --> <field name="host" type="url" stored="false" indexed="false"/> <field name="site" type="string" stored="true" indexed="false"/> <field name="url" type="url" stored="true" indexed="false" required="true"/> <field name="content" type="text" stored="true" indexed="true"/> <field name="title" type="text" stored="true" indexed="false"/> <field name="cache" type="string" stored="true" indexed="false"/> <field name="tstamp" type="date" stored="true" indexed="false"/> <!-- fields for index-anchor plugin --> <field name="anchor" type="string" stored="true" indexed="false" multiValued="true"/> <!-- fields for index-more plugin --> <field name="type" type="string" stored="true" indexed="false" multiValued="true"/> <field name="contentLength" type="long" stored="true" indexed="false"/> <field name="lastModified" type="date" stored="false" indexed="false"/> <field name="date" type="date" stored="true" indexed="false"/> <!-- fields for languageidentifier plugin --> <field name="lang" type="string" stored="true" indexed="false"/> <!-- fields for subcollection plugin --> <field name="subcollection" type="string" stored="true" indexed="false" multiValued="true"/> <!-- fields for feed plugin (tag is also used by microformats-reltag)--> <field name="author" type="string" stored="true" indexed="true"/> <field name="tag" type="string" stored="true" indexed="true" multiValued="false"/> <field name="feed" type="string" stored="true" indexed="false"/> <field name="publishedDate" type="date" stored="true" indexed="false"/> <field name="updatedDate" type="date" stored="true" indexed="false"/> <!-- fields for creativecommons plugin --> <field name="cc" type="string" stored="true" indexed="true" multiValued="true"/> </fields> what do you think? alessio Il giorno 07 aprile 2012 21:57, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> ha scritto: > From the limited HTML that I've seen I can only assume that the offending > xhtml is in the content field. > > If this is the case then you will need to write a custom plugin > implementation that removes this. There is loads of info allowing you to > get up to speed with plugins on our wiki.[0] > > Once you have something that requires help get on to the list and let us > know. > > Lewis > > [0] http://wiki.apache.org/nutch/PluginCentral > > On Sat, Apr 7, 2012 at 2:33 PM, alessio crisantemi < > alessio.crisant...@gmail.com> wrote: > > > may be it'd my cause with my schema? > > I chose for inex about only title, author and content. > > > > can you help me for setting a parsefilter? > > thank you > > alessio > > > > >