Re: RFC unifying phrase search behaviour

2014-02-25 Thread Ferran Jorba
Hi,

Alexander Wagner a.wag...@fz-juelich.de wrote:
 
 On 24.02.2014 11:30, Tibor Simko wrote:

 Hi!

 People don't easily distinguish between the following queries:

 title:'some phrase'

 substring

 title:some phrase

 exact search

 [...]

Once more, I agree with Alexander.  The whole reply.

Danke,

Ferran


Re: RFC unifying phrase search behaviour

2014-02-25 Thread Tibor Simko
On Mon, 24 Feb 2014, Alexander Wagner wrote:
 245:'some phrase'
 245:some phrase

 so that single-quoted and double-quoted phrase queries would always
 return the same result.

 Which is then an exact match, right? So to get '' matches one would
 use *bla*, right?

No, actually, not an exact match, but a word pair match.  Here is a
possibly clearer example.  Consider the following record:

   245 $a The Kreutzer Sonata

When users type:

   245:'Kreutzer Sonata'
   245:Kreutzer Sonata

then the record would be returned.

When users type:

   245:'reutzer son'
   245:reutzer son

then the record won't be returned; people would have to type:

   245:/reutzer son/

in order to get a substring match.

In summary:

  +---+---++
  | QUERY | CURRENT BEHAVIOUR | PROPOSED BEHAVIOUR |
  +---+---++
  | 245:'Kreutzer Sonata' | hit   | hit|
  | 245:Kreutzer Sonata | miss  | hit|
  | 245:'reutzer son' | hit   | miss   |
  | 245:reutzer son | miss  | miss   |
  | 245:/reutzer son/ | hit   | hit|
  +---+---++

Note that proposed behaviour is already the case for some logical
indexes such as title in Invenio v1.1 release series and above.  The
current RFC proposes to widen its scope to cover all indexes, including
physical MARC queries.

 If I get it correctly, it breaks almost all our bean counting. IDs are
 something like

   sid:(DE-HGF)1

 or

   sid:(DE-HGF)11

 if you map sid:(DE-HGF)1 to the old 'sid:(DE-HGF)1' it matches also
sid:(DE-HGF)11, which is wrong and not intended.

Nope, it would not be mapped that way, see above.  The ID matching would
remain safe.

Best regards
--
Tibor Simko


RFC unifying phrase search behaviour

2014-02-24 Thread Tibor Simko
Hi:

For the forthcoming release of Invenio v1.2, we'd like to change one
long-standing feature related to phrase queries.

People don't easily distinguish between the following queries:

   title:'some phrase'
   title:some phrase

which is why in 2012 we have introduced a configuration option that
enables to specify for each and every index whether the difference
between single-quoted and double-quoted expressions should be respected.
By default, we have killed the difference in the most exposed indexes
such as global, title, abstract, but we have kept it for MARC queries in
order not to break existing cataloguing workflows.

We'd now like to extend this to all indexes by default, including MARC
queries like:

   245:'some phrase'
   245:some phrase

so that single-quoted and double-quoted phrase queries would always
return the same result.

What this change means for you:

1. The end users can use single-quoted or double-quoted queries to
   express phrase search, in all indexes.  There would be no difference.

2. The phrase search would be done by default via word pair matching,
   unless indexes are tokenised in a special manner (e.g. exact author
   name) or unless users search inside physical MARC tags (when no word
   pair index exists).

3. If you have relied on partial phrase matching, please switch to
   regular expression queries like:

  245:/some phrase/
  245:/[[:blank:]]some phrase[[:blank:]]/

4. If you have relied on exact phrase matching, please switch to
   regular expression queries like:

  245:/^Exact title.$/

Please holler if this change could badly break some of your workflows.

References:
[1] http://invenio-demo.cern.ch/help/search-guide#words-vs-phrases
[2] http://invenio-software.org/ticket/137
[3] 
https://github.com/inveniosoftware/invenio/blob/master/modules/websearch/lib/search_engine_config.py#L33

Best regards
-- 
Tibor Simko


Re: RFC unifying phrase search behaviour

2014-02-24 Thread Alexander Wagner

On 24.02.2014 11:30, Tibor Simko wrote:

Hi!


People don't easily distinguish between the following queries:

title:'some phrase'


substring


title:some phrase


exact search

[...]

245:'some phrase'
245:some phrase

so that single-quoted and double-quoted phrase queries would always
return the same result.


Which is then an exact match, right? So to get '' matches
one would use *bla*, right?


What this change means for you:

1. The end users can use single-quoted or double-quoted queries to
express phrase search, in all indexes.  There would be no difference.

2. The phrase search would be done by default via word pair matching,
unless indexes are tokenised in a special manner (e.g. exact author
name) or unless users search inside physical MARC tags (when no word
pair index exists).

3. If you have relied on partial phrase matching, please switch to
regular expression queries like:

   245:/some phrase/
   245:/[[:blank:]]some phrase[[:blank:]]/



This should be 245:*some phrase*...


4. If you have relied on exact phrase matching, please switch to
regular expression queries like:

   245:/^Exact title.$/


This should be 245:Exact title.

Sorry, if I ask here.

If I get this correctly, every /exact/ search, in old world
bla (no substring) would be a regular expression now, in
this new scheme, right?

This would IMHO /not/ be sensible at all.

First of all if I place bla explicitly in quotes I /expect/
it to be an exact match and not a substring, so it is
contraintuitive. See G (and friends): the only way to switch
off their intelligence is to put things explicitly in
quotes.

Secondly, it would mean that all our ID searches which are
ID:(src)Number-type things end up in really /expensive/
regexp searches.

I.e. we would regexp something like this:

http://juser.fz-juelich.de/search?p=%28collection%3A%22VDB%22+and+web%3A%222013%22%29+and+%28id%3A%22WOS%3A000%2A%22+or+sid%3A%22StatID%3A%28DE-HGF%290100%22+or+sid%3A%22StatID%3A%28DE-HGF%290110%22+or+sid%3A%22StatID%3A%28DE-HGF%290111%22+or+sid%3A%22StatID%3A%28DE-HGF%290120%22+or+sid%3A%22StatID%3A%28DE-HGF%290130%22%29+and+pof%3A%22G%3A%28DE-HGF%29POF2-110%22


Please holler if this change could badly break some of
your workflows.


If I get it correctly, it breaks almost all our bean
counting. IDs are something like

  sid:(DE-HGF)1

or

  sid:(DE-HGF)11

if you map sid:(DE-HGF)1 to the old 'sid:(DE-HGF)1' it
matches also sid:(DE-HGF)11, which is wrong and not
intended.

So, if one wants to unify quotes (I agree that
distinguishing '' vs  is difficult to explain and
especially needs explanation) then one should unify it to
use quotes always for exact matches. One could then have
substring search by either leaving out the quotes or putting
explicit * oprators like

  *bla*

ie. let the '' behave like the  in invenio logic but not
to have  call sub string matches.

This is also something I can explain to the Normal User(tm),
while I think your regexps above are a bit beyond their
common language ;)

--

Kind regards,

Alexander Wagner
Scientific Services / Scientific Publishing
Central Library
52425 Juelich

mail : a.wag...@fz-juelich.de
phone: +49 2461 61-1586
Fax  : +49 2461 61-6103
http://www.fz-juelich.de/zb/wp




Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt