Hi Ferran,

On 10/11/2012 12:40 PM, Ferran Jorba wrote:
Hi Ludmila,

[...]
I'm not sure, what whould happen if you add the missing '(' and ')' to
CFG_BIBINDEX_CHARS_PUNCTUATION and reindex your site?  Could you just
use expressions like "StatID DE-HGF 0100" and "StatID DE-HGF 0110",
removing all problematic characters?  (Yes, I know your site is large,
maybe you have a smaller installation to test this).
These 2 config variable are useful for breaking phrases into words
(thus having an impact on the indexing of words - or better said, what
is considered a 'word'), but they won't help too much if you need to
do exact phrase search (which uses the phrase index).  The
CFG_BIBINDEX_CHARS_PUNCTUATION is used to split the phrase into blocks
(which also get indexed as words) and then the blocks are further
split into words using CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS.
All this work is done for computing the words, but phrases are left as
they are (without any processing).  We can check in ipython what would
be the difference if '(' and ')' get added.
thanks for the detailed explanation; I could hardly arrive to this
conclusion my own.  Once I tried to add a few local variations of
Quotation marks that, you know, are highly cultural (see, for example,
http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage), but I
wasn't able to come up with a good behaviour, although I don't remember
the details now; it was quite ago.  Where should I have to add the '«',
'»', '„', '“'?  Just to CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS,
right?

If we take an example: foo«bar-baz»

If you add them to the CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS, the words 
indexed will be: ['foo«bar-baz»', 'foo', 'bar', 'baz']
If you add them to the CFG_BIBINDEX_CHARS_PUNCTUATION (or to both), the words 
indexed will be as above, plus ['bar-baz']

So I guess it depends on what would you like to index. If you would like to 
indexed what is in the quotations as a single word also (maybe in case of 
report numbers),
then they should be added to the CFG_BIBINDEX_CHARS_PUNCTUATION.

Of course, all of the above apply in cases where we are trying to split a word 
(sequence of characters without space) into blocks.


Cheers,
Ludmila

--
Ludmila Marian ** CERN Document Server ** <http://cds.cern.ch/>

Reply via email to