#688: generalization of second-order operators in SEQP
------------------------+----------------------
Reporter: valkyrie | Owner: valkyrie
Type: task | Status: new
Priority: major | Milestone:
Component: WebSearch | Version:
Resolution: | Keywords: syntax
------------------------+----------------------
Comment (by tbrooks):
The use case, and immediate driver for this ticket, as described above, is
find cc US
This search, in SPIRES, in the HEP collection, means: find all papers
from institutions whose institutions records have a country code = US
This is basically a join, and the join is on 100u/700u (HEP) <->110u
(inst)
As Valkyrie and I mused, we could do this on indexing time, but it is
tricky...since the indexer needs to update index entries for records that
have not changed (i.e. the inst. record that is connected to this paper
has changed, but this paper hasn't, but its index record needs to be
updated.) To me, this seems very isomorphic to the citation case, where
the other collection serves as the citation dictionary.
Additionally complicating the indexing time solution is the fact that once
you understand what keys to join on, you should really have access to the
full set of indexes in the other collection. I.e. once I know that
100u<->110a connects HEP to Inst, I should be able to access all indexes
from the other collection via this relationship. This argues for having
a configuration + searching solution rather than indexing time.
Especially since the use cases here are usually rare (i.e. admin use,
occasional user use, so speed is not as crucial as flexibility and power)
As regards the syntax:
Your example: (author:"Doe, J" in "ATLAS Notes")
I'm not sure what it means. This would be in HEP? Searching for ATLAS
notes written by Doe, J? But how are these notes connected to HEP? What
is the relation on which we are joining the collections? I guess I don't
see what this searcher is expecting to see, so I'm not sure I can
understand whether the syntax makes sense. For me these 2nd order
extensions are only reasonable for "authority file" type collections,
where there is a sensible mapping from one collection to another index in
the other. For "ATLAS Notes" and similar collections within HEP or with
similar data model as HEP, I think the joint/combined searching would be
handled very differently.
Similarly author:doe in (refersto:author:ellis and muon) doesn't make
sense to me as "in" here seems identical to "and" as these are all in HEP
collections.
For example there are similar use cases that will come from conferences
(find all papers presented at conferences in FRance) HEPNames (find all
papers by undergraduates from Case Western), experiments...etc etc
my proposal, again, is to handle the above cases by defining in a config
files which collections can provide second order search, and what fields
are joined in these cases:
so a sample config file would look like:
HEP:affiliation::Institutions::110_u
(meaning in HEP one searches the affiliation index using the 110_u value
from the inst record
HEP:cnum::Conferences::<whatever>
HEP:author::HEPNames::100_i (etc etc)
And
The we can search using
Institutions:<any inst index>:<search term> and bring that back to HEP via
the above relation.
The point here in the syntax being that the <any inst index> should be
prefaced in such a way that we see we are expecting to use it in inst.
however <any inst index>:<search term> in inst is reasonable, just a bit
more SPIRES-y so it doesn't fit with Invenio syntax. I'm not sure which
one is easier to parse, but I liked the analogy to refersto: in that
refersto invokes a similar second order operation/indirect search.
Whatever is easy to parse in the parser, and reasonably general is fine
with me.
OK regardless of these general concerns, we need to implement something
for find cc US, and we could hack searching as a one-off, or indexing as a
one-off, however I think making more general would be good.
--
Ticket URL: <http://invenio-software.org/ticket/688#comment:3>
Invenio <http://invenio-software.org>