Look at the tokenizer/filter chain that makes up your analyzers, and see: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
for other tokenizer/analyzer/filter options. You're on the right track looking at the various choices provided, and I suspect you'll find what you need... Be a little cautious about preserving things. Your users will often be more confused than helped if you require hyphens for a match. Ditto with possessives, plurals, etc. You might want to look at stemmers.... Best Erick On Sat, Aug 28, 2010 at 6:20 PM, Shawn Heisey <elyog...@elyograg.org> wrote: > It's metadata for a collection of 45 million documents that is mostly > photos, with some videos and text. The data is imported from a MySQL > database and split among six large shards (each nearly 13GB) and a small > shard with data added in the last week. That works out to between 300,000 > and 500,000 documents. > > I am mostly trying to think of ways to drastically reduce the index size > without reducing the functionality. Using copyField would just make it > larger. > > I would like to make it so that I don't have two terms when there's a > punctuation character at the beginning or end of a word. For intstance, one > field value that I just analyzed ends up with terms like the following, > which are unneeded duplicates: > > > championship. > championship > '04 > 04 > wisconsin. > wisconsin > > Since I was already toying around, I just tested the whole notion. I ran > it through once with just generateWordParts and catenateWords enabled, then > again with all the options including preserveOriginal enabled. A test > analysis of input with 59 whitespace separated words showed 93 terms with > the single filter and 77 with two. The only drop in term quality that I > noticed was that possessive words (apostrophe-s) no longer have the original > preserved. I haven't yet decided whether that's a problem. > > > Shawn > > > On 8/27/2010 11:00 AM, Erick Erickson wrote: > >> I agree with Marcus, the usefulness of passing through WDF twice >> is suspect. You can always do a copyfield to a completely different >> field and do whatever you want there, copyfield forks the raw input >> to the second field, not the analyzed stream... >> >> What is it you're really trying to accomplish? Your use-case would >> help us help you. >> >> About defining things differently in index and analysis. Sure, it can >> make sense. But, especially with WDF it's tricky. Spend some >> significant time in the admin analysis page looking at the effects >> of various configurations before you decide. >> >> Best >> Erick >> > >