Re: RFC unifying phrase search behaviour
Hi, Alexander Wagner a.wag...@fz-juelich.de wrote: On 24.02.2014 11:30, Tibor Simko wrote: Hi! People don't easily distinguish between the following queries: title:'some phrase' substring title:some phrase exact search [...] Once more, I agree with Alexander. The whole reply. Danke, Ferran
Re: RFC unifying phrase search behaviour
On Mon, 24 Feb 2014, Alexander Wagner wrote: 245:'some phrase' 245:some phrase so that single-quoted and double-quoted phrase queries would always return the same result. Which is then an exact match, right? So to get '' matches one would use *bla*, right? No, actually, not an exact match, but a word pair match. Here is a possibly clearer example. Consider the following record: 245 $a The Kreutzer Sonata When users type: 245:'Kreutzer Sonata' 245:Kreutzer Sonata then the record would be returned. When users type: 245:'reutzer son' 245:reutzer son then the record won't be returned; people would have to type: 245:/reutzer son/ in order to get a substring match. In summary: +---+---++ | QUERY | CURRENT BEHAVIOUR | PROPOSED BEHAVIOUR | +---+---++ | 245:'Kreutzer Sonata' | hit | hit| | 245:Kreutzer Sonata | miss | hit| | 245:'reutzer son' | hit | miss | | 245:reutzer son | miss | miss | | 245:/reutzer son/ | hit | hit| +---+---++ Note that proposed behaviour is already the case for some logical indexes such as title in Invenio v1.1 release series and above. The current RFC proposes to widen its scope to cover all indexes, including physical MARC queries. If I get it correctly, it breaks almost all our bean counting. IDs are something like sid:(DE-HGF)1 or sid:(DE-HGF)11 if you map sid:(DE-HGF)1 to the old 'sid:(DE-HGF)1' it matches also sid:(DE-HGF)11, which is wrong and not intended. Nope, it would not be mapped that way, see above. The ID matching would remain safe. Best regards -- Tibor Simko
RFC unifying phrase search behaviour
Hi: For the forthcoming release of Invenio v1.2, we'd like to change one long-standing feature related to phrase queries. People don't easily distinguish between the following queries: title:'some phrase' title:some phrase which is why in 2012 we have introduced a configuration option that enables to specify for each and every index whether the difference between single-quoted and double-quoted expressions should be respected. By default, we have killed the difference in the most exposed indexes such as global, title, abstract, but we have kept it for MARC queries in order not to break existing cataloguing workflows. We'd now like to extend this to all indexes by default, including MARC queries like: 245:'some phrase' 245:some phrase so that single-quoted and double-quoted phrase queries would always return the same result. What this change means for you: 1. The end users can use single-quoted or double-quoted queries to express phrase search, in all indexes. There would be no difference. 2. The phrase search would be done by default via word pair matching, unless indexes are tokenised in a special manner (e.g. exact author name) or unless users search inside physical MARC tags (when no word pair index exists). 3. If you have relied on partial phrase matching, please switch to regular expression queries like: 245:/some phrase/ 245:/[[:blank:]]some phrase[[:blank:]]/ 4. If you have relied on exact phrase matching, please switch to regular expression queries like: 245:/^Exact title.$/ Please holler if this change could badly break some of your workflows. References: [1] http://invenio-demo.cern.ch/help/search-guide#words-vs-phrases [2] http://invenio-software.org/ticket/137 [3] https://github.com/inveniosoftware/invenio/blob/master/modules/websearch/lib/search_engine_config.py#L33 Best regards -- Tibor Simko
Re: RFC unifying phrase search behaviour
On 24.02.2014 11:30, Tibor Simko wrote: Hi! People don't easily distinguish between the following queries: title:'some phrase' substring title:some phrase exact search [...] 245:'some phrase' 245:some phrase so that single-quoted and double-quoted phrase queries would always return the same result. Which is then an exact match, right? So to get '' matches one would use *bla*, right? What this change means for you: 1. The end users can use single-quoted or double-quoted queries to express phrase search, in all indexes. There would be no difference. 2. The phrase search would be done by default via word pair matching, unless indexes are tokenised in a special manner (e.g. exact author name) or unless users search inside physical MARC tags (when no word pair index exists). 3. If you have relied on partial phrase matching, please switch to regular expression queries like: 245:/some phrase/ 245:/[[:blank:]]some phrase[[:blank:]]/ This should be 245:*some phrase*... 4. If you have relied on exact phrase matching, please switch to regular expression queries like: 245:/^Exact title.$/ This should be 245:Exact title. Sorry, if I ask here. If I get this correctly, every /exact/ search, in old world bla (no substring) would be a regular expression now, in this new scheme, right? This would IMHO /not/ be sensible at all. First of all if I place bla explicitly in quotes I /expect/ it to be an exact match and not a substring, so it is contraintuitive. See G (and friends): the only way to switch off their intelligence is to put things explicitly in quotes. Secondly, it would mean that all our ID searches which are ID:(src)Number-type things end up in really /expensive/ regexp searches. I.e. we would regexp something like this: http://juser.fz-juelich.de/search?p=%28collection%3A%22VDB%22+and+web%3A%222013%22%29+and+%28id%3A%22WOS%3A000%2A%22+or+sid%3A%22StatID%3A%28DE-HGF%290100%22+or+sid%3A%22StatID%3A%28DE-HGF%290110%22+or+sid%3A%22StatID%3A%28DE-HGF%290111%22+or+sid%3A%22StatID%3A%28DE-HGF%290120%22+or+sid%3A%22StatID%3A%28DE-HGF%290130%22%29+and+pof%3A%22G%3A%28DE-HGF%29POF2-110%22 Please holler if this change could badly break some of your workflows. If I get it correctly, it breaks almost all our bean counting. IDs are something like sid:(DE-HGF)1 or sid:(DE-HGF)11 if you map sid:(DE-HGF)1 to the old 'sid:(DE-HGF)1' it matches also sid:(DE-HGF)11, which is wrong and not intended. So, if one wants to unify quotes (I agree that distinguishing '' vs is difficult to explain and especially needs explanation) then one should unify it to use quotes always for exact matches. One could then have substring search by either leaving out the quotes or putting explicit * oprators like *bla* ie. let the '' behave like the in invenio logic but not to have call sub string matches. This is also something I can explain to the Normal User(tm), while I think your regexps above are a bit beyond their common language ;) -- Kind regards, Alexander Wagner Scientific Services / Scientific Publishing Central Library 52425 Juelich mail : a.wag...@fz-juelich.de phone: +49 2461 61-1586 Fax : +49 2461 61-6103 http://www.fz-juelich.de/zb/wp Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt