If you can express what you want with a regular expression then the pattern Filter should work! I'm thinking that maybe you tokenized the field and that invalidated the structure of the html.
I would use a "contents" field analized with a http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory and use copyField to another field Title that has a KeywordTokenizer in combination with PatternFilter (with the pattern of the title of your pages) Thanks Emmanuel 2011/7/27 Rafael Ribeiro <rafae...@gmail.com> > Hi all, > > I am trying to index html documents using Solr and I am having > difficulties > to extract certain parts of the main content of the document and store them > sepparately into other fields. I saw on the docs that it is possible to > achieve this using xpath but in my certain case I need to do a regex match. > To be more specifical I am willing to copy a certain pattern content to > title field. My first attempt was to define a custom field type with a > PatternFilter and copy content field to title field but this did not work. > Next attempt was to specify that copyField tag would have a pattern and > group attributes but this did not work as well. > > Is it possible to do what I am trying? I am unwilling to resort to grep > outside Solr as I am pretty sure Solr is capable of doing what I want... > > best regards, > Rafael Ribeiro > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Filter-content-upon-indexing-tp3203946p3203946.html > Sent from the Solr - User mailing list archive at Nabble.com. >