: Another question I have is where the processing of this "first letter" is
: more adequate.
: I am considering updating my data import handler to execute a script to
: extract the first letter from the author field.
: 
: I saw other thread when someone mentioned using a field analyser to extract
: the letter using a regex.
: Which one is the best option?

"best" is subjective.

conceptually, "inherient" rules/concepts of your data (ie: what files it 
has, what types those fields have, etc...) should live in your schema.xml, 
while things specific to where your data comes from should live in other 
configs (ie: your DIH config, update processors, etc...)

so for something like an "first_letter_author_name" field that should (by 
definition) always be the same as the first letter of the "author_name" 
field, it should be specified in your schema.xml (two ways i can think of: 
copyField w/maxChars, or an EdgeNGramTokenizer) .. thta way no matter how 
a document gets in your index (DIH, XML Push, CSV Push, etc...) you can be 
certain the fields will be internally consistents.

Practically speaking: there's a lot of "inherient" rules that can't be 
expressed in the schema.xml, or may be confusing to people if they are 
expressed there while other more complex rules are expressed elsewhere -- 
so go with whatever makes the most sense to you, and is the easiest for 
you to maintain.


-Hoss

Reply via email to