If you can express what you want with a regular expression then the pattern
Filter should work! I'm thinking that maybe you tokenized the field and that
invalidated the structure of the html.

I would use a "contents" field analized with a
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
and use copyField to another field Title that has a KeywordTokenizer in
combination with PatternFilter (with the pattern of the title of your pages)

Thanks
Emmanuel

2011/7/27 Rafael Ribeiro <rafae...@gmail.com>

> Hi all,
>
>  I am trying to index html documents using Solr and I am having
> difficulties
> to extract certain parts of the main content of the document and store them
> sepparately into other fields. I saw on the docs that it is possible to
> achieve this using xpath but in my certain case I need to do a regex match.
>  To be more specifical I am willing to copy a certain pattern content to
> title field. My first attempt was to define a custom field type with a
> PatternFilter and copy content field to title field but this did not work.
> Next attempt was to specify that copyField tag would have a pattern and
> group attributes but this did not work as well.
>
>  Is it possible to do what I am trying? I am unwilling to resort to grep
> outside Solr as I am pretty sure Solr is capable of doing what I want...
>
> best regards,
> Rafael Ribeiro
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Filter-content-upon-indexing-tp3203946p3203946.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply via email to