Look at the tokenizer/filter chain that makes up your analyzers, and see:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

for other tokenizer/analyzer/filter options.

You're on the right track looking at the various choices provided, and
I suspect you'll find what you need...

Be a little cautious about preserving things. Your users will often be more
confused than helped if you require hyphens for a match. Ditto with
possessives, plurals, etc. You might want to look at stemmers....

Best
Erick

On Sat, Aug 28, 2010 at 6:20 PM, Shawn Heisey <elyog...@elyograg.org> wrote:

>  It's metadata for a collection of 45 million documents that is mostly
> photos, with some videos and text.  The data is imported from a MySQL
> database and split among six large shards (each nearly 13GB) and a small
> shard with data added in the last week.  That works out to between 300,000
> and 500,000 documents.
>
> I am mostly trying to think of ways to drastically reduce the index size
> without reducing the functionality.  Using copyField would just make it
> larger.
>
> I would like to make it so that I don't have two terms when there's a
> punctuation character at the beginning or end of a word.  For intstance, one
> field value that I just analyzed ends up with terms like the following,
> which are unneeded duplicates:
>
>
> championship.
> championship
> '04
> 04
> wisconsin.
> wisconsin
>
> Since I was already toying around, I just tested the whole notion.  I ran
> it through once with just generateWordParts and catenateWords enabled, then
> again with all the options including preserveOriginal enabled.  A test
> analysis of input with 59 whitespace separated words showed 93 terms with
> the single filter and 77 with two.  The only drop in term quality that I
> noticed was that possessive words (apostrophe-s) no longer have the original
> preserved.  I haven't yet decided whether that's a problem.
>
>
> Shawn
>
>
> On 8/27/2010 11:00 AM, Erick Erickson wrote:
>
>> I agree with Marcus, the usefulness of passing through WDF twice
>> is suspect. You can always do a copyfield to a completely different
>> field and do whatever you want there, copyfield forks the raw input
>> to the second field, not the analyzed stream...
>>
>> What is it you're really trying to accomplish? Your use-case would
>> help us help you.
>>
>> About defining things differently in index and analysis. Sure, it can
>> make sense. But, especially with WDF it's tricky. Spend some
>> significant time in the admin analysis page looking at the effects
>> of various configurations before you decide.
>>
>> Best
>> Erick
>>
>
>

Reply via email to