I want to add, that since the stored text (not the indexed) is not analyzed,
if you retrieve the title you will get all the html. If you want to extract
the title for storage in a separate field that will have to be done with a
different tool not just with the analysis. My previous answer was focused
only in extraction of text for searching purposes.

Thanks
Emmanuel

2011/7/27 Emmanuel Espina <espinaemman...@gmail.com>

> If you can express what you want with a regular expression then the pattern
> Filter should work! I'm thinking that maybe you tokenized the field and that
> invalidated the structure of the html.
>
> I would use a "contents" field analized with a
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> and use copyField to another field Title that has a KeywordTokenizer in
> combination with PatternFilter (with the pattern of the title of your pages)
>
> Thanks
> Emmanuel
>
>
> 2011/7/27 Rafael Ribeiro <rafae...@gmail.com>
>
>> Hi all,
>>
>>  I am trying to index html documents using Solr and I am having
>> difficulties
>> to extract certain parts of the main content of the document and store
>> them
>> sepparately into other fields. I saw on the docs that it is possible to
>> achieve this using xpath but in my certain case I need to do a regex
>> match.
>>  To be more specifical I am willing to copy a certain pattern content to
>> title field. My first attempt was to define a custom field type with a
>> PatternFilter and copy content field to title field but this did not work.
>> Next attempt was to specify that copyField tag would have a pattern and
>> group attributes but this did not work as well.
>>
>>  Is it possible to do what I am trying? I am unwilling to resort to grep
>> outside Solr as I am pretty sure Solr is capable of doing what I want...
>>
>> best regards,
>> Rafael Ribeiro
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Filter-content-upon-indexing-tp3203946p3203946.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>

Reply via email to