2 ways I can think of ...

   - ExtractingRequestHandler (this is what I am guessing you are using now)

Set extractOnly=true while making a request to the extractingRequestHandler
and get the parsed content back. Now make a post request on update request
handler with what ever fields and field values you want.


   - Use HTMLStripWhiteSpaceTokenizer factory. This article may be helpful
   to explain what I mean.
   
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory.



- Raghu



On Sat, Dec 5, 2009 at 3:44 AM, khalid y <kern...@gmail.com> wrote:

> Hi,
>
> I have a problem with solr. I'm indexing some html content and solr crash
> because my id field is multivalued.
> I found that Tika read the html and extract metadata like <meta name="id"
> content="12"> from my htmls but my documents has an already an id setted by
> literal.id=10.
>
> I tried to map the id from Tika by fmap.id=ignored_ but it ignore also my
> literal.id
>
> I'm using solr 1.4 and tika 0.5
>
> Someone can explain to me how I can ignore this the Tika id metadata ??
>
> Thanks
>

Reply via email to