Re: Complex boolean searches

Ludmila Marian Thu, 11 Oct 2012 02:48:25 -0700

Hi Ferran,


On 10/11/2012 09:22 AM, Ferran Jorba wrote:

Hello Alexander and Ludmila,

On 01.10.2012 18:37, Ludmila Marian wrote:

Hello Ludmila!

915__:'StatID:(DE-HGF)0100' OR 915__:'StatID:(DE-HGF)0110' AND
9201_:'I:(DE-Juel1)ZB-20090406'

Actually, if these are exact patterns (full content of the tags, and not
substrings), replacing simple quotes with double quotes should give
faster search time:

915__:"StatID:(DE-HGF)0100" OR 915__:"StatID:(DE-HGF)0110" AND
9201_:"I:(DE-Juel1)ZB-20090406"

I fear I'd have to ask you for a fix concerning the braces in search
terms if they are contained in the "literal string" tags. (An

Maybe I'll take advantage of this thread so I can understand something
that never has been clear to me, but: which is the behaviour of the ':',
'(' and ')' characters as separators?  I mean, there are those two
Invenio config variables (CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS and
CFG_BIBINDEX_CHARS_PUNCTUATION) that maybe you can use to your
advangage:

  http://invenio-software.org/repo/invenio/tree/config/invenio.conf#n929

I'm not sure, what whould happen if you add the missing '(' and ')' to
CFG_BIBINDEX_CHARS_PUNCTUATION and reindex your site?  Could you just
use expressions like "StatID DE-HGF 0100" and "StatID DE-HGF 0110",
removing all problematic characters?  (Yes, I know your site is large,
maybe you have a smaller installation to test this).


These 2 config variable are useful for breaking phrases into words (thus having 
an impact on the indexing of words - or better said, what is considered a 
'word'),
but they won't help too much if you need to do exact phrase search (which uses 
the phrase index).
The CFG_BIBINDEX_CHARS_PUNCTUATION is used to split the phrase into blocks 
(which also get indexed as words) and then the blocks are further split into 
words
using CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS.
All this work is done for computing the words, but phrases are left as they are 
(without any processing).
We can check in ipython what would be the difference if '(' and ')' get added.

Without them:
In [4]: p = "StatID:(DE-HGF)0100" #our phrase in these example
In [4]: words = BibIndexWordTokenizer()
In [5]: words.tokenize(p)
Out[5]: ['(de-hgf)0100', 'statid:(de-hgf)0100', 'de', '0100', 'statid', 'hgf'] 
#what the system considers as words

In [7]: pairs = BibIndexPairTokenizer()
In [8]: pairs.tokenize(p)
Out[8]: ['hgf 0100', 'de hgf', 'statid de'] #pairs

In [9]: phrase = BibIndexPhraseTokenizer()
In [10]: phrase.tokenize(p)
Out[10]: ['StatID:(DE-HGF)0100'] #phrase - the hole thing

With '(' and ')' added:
In [4]: words.tokenize(p)
Out[4]: ['de-hgf', 'statid:(de-hgf)0100', 'de', '0100', 'statid', 'hgf']

In [6]: pairs.tokenize(p)
Out[6]: ['hgf 0100', 'de hgf', 'statid de']

In [11]: phrase.tokenize(p)
Out[11]: ['StatID:(DE-HGF)0100']

So by adding the '(' and ')' we are instructing the indexer to consider them as 
spaces, thus to break the phrase when it finds them.

Cheers,
Ludmila

--
Ludmila Marian ** CERN Document Server ** <http://cds.cern.ch/>

Re: Complex boolean searches

Reply via email to