Hi Ferran,
On 10/11/2012 12:40 PM, Ferran Jorba wrote:
Hi Ludmila, [...]I'm not sure, what whould happen if you add the missing '(' and ')' to CFG_BIBINDEX_CHARS_PUNCTUATION and reindex your site? Could you just use expressions like "StatID DE-HGF 0100" and "StatID DE-HGF 0110", removing all problematic characters? (Yes, I know your site is large, maybe you have a smaller installation to test this).These 2 config variable are useful for breaking phrases into words (thus having an impact on the indexing of words - or better said, what is considered a 'word'), but they won't help too much if you need to do exact phrase search (which uses the phrase index). The CFG_BIBINDEX_CHARS_PUNCTUATION is used to split the phrase into blocks (which also get indexed as words) and then the blocks are further split into words using CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS. All this work is done for computing the words, but phrases are left as they are (without any processing). We can check in ipython what would be the difference if '(' and ')' get added.thanks for the detailed explanation; I could hardly arrive to this conclusion my own. Once I tried to add a few local variations of Quotation marks that, you know, are highly cultural (see, for example, http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage), but I wasn't able to come up with a good behaviour, although I don't remember the details now; it was quite ago. Where should I have to add the '«', '»', '„', '“'? Just to CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS, right?
If we take an example: foo«bar-baz» If you add them to the CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS, the words indexed will be: ['foo«bar-baz»', 'foo', 'bar', 'baz'] If you add them to the CFG_BIBINDEX_CHARS_PUNCTUATION (or to both), the words indexed will be as above, plus ['bar-baz'] So I guess it depends on what would you like to index. If you would like to indexed what is in the quotations as a single word also (maybe in case of report numbers), then they should be added to the CFG_BIBINDEX_CHARS_PUNCTUATION. Of course, all of the above apply in cases where we are trying to split a word (sequence of characters without space) into blocks. Cheers, Ludmila -- Ludmila Marian ** CERN Document Server ** <http://cds.cern.ch/>
