It's metadata for a collection of 45 million documents that is mostly
photos, with some videos and text. The data is imported from a MySQL
database and split among six large shards (each nearly 13GB) and a small
shard with data added in the last week. That works out to between
300,000 and 500,000 documents.
I am mostly trying to think of ways to drastically reduce the index size
without reducing the functionality. Using copyField would just make it
larger.
I would like to make it so that I don't have two terms when there's a
punctuation character at the beginning or end of a word. For intstance,
one field value that I just analyzed ends up with terms like the
following, which are unneeded duplicates:
championship.
championship
'04
04
wisconsin.
wisconsin
Since I was already toying around, I just tested the whole notion. I
ran it through once with just generateWordParts and catenateWords
enabled, then again with all the options including preserveOriginal
enabled. A test analysis of input with 59 whitespace separated words
showed 93 terms with the single filter and 77 with two. The only drop
in term quality that I noticed was that possessive words (apostrophe-s)
no longer have the original preserved. I haven't yet decided whether
that's a problem.
Shawn
On 8/27/2010 11:00 AM, Erick Erickson wrote:
I agree with Marcus, the usefulness of passing through WDF twice
is suspect. You can always do a copyfield to a completely different
field and do whatever you want there, copyfield forks the raw input
to the second field, not the analyzed stream...
What is it you're really trying to accomplish? Your use-case would
help us help you.
About defining things differently in index and analysis. Sure, it can
make sense. But, especially with WDF it's tricky. Spend some
significant time in the admin analysis page looking at the effects
of various configurations before you decide.
Best
Erick