Hi all,
Firstly, I apologise for the length of this email but I need to
describe properly what I'm doing before I get to the problem!
I'm working on a project just now which requires the ability to store
and search on temporal coverage data - ie. a field which specifies a
date range during which a certain event took place.
I hunted around for a few days and couldn't find anything which seemed
to fit, so I had a go at writing my own field type based on
solr.PointType. It's used as follows:
schema.xml
<fieldType name="temporal" class="solr.TemporalCoverage"
dimension="2" subFieldSuffix="_i"/>
<field name="daterange" type="temporal" indexed="true" stored="true"
multiValued="true"/>
data.xml
<add>
<doc>
...
<field name="daterange">1940,1945</field>
</doc>
</add>
Internally, this gets stored as:
<arr name="daterange"><str>1940,1945</str></arr>
<int name="daterange_0_i">19400000</int>
<int name="daterange_1_i">19450000</int>
In due course, I'll declare the subfields as a proper date type, but
in the meantime, this works absolutely fine. I can search for an
individual date and Solr will check (queryDate > daterange_0 AND
queryDate < daterange_1 ) and the correct documents are returned. My
code also allows the user to input a date range in the query but I
won't complicate matters with that just now!
The problem arises when a document has more than one "daterange" field
(imagine a news broadcast which covers a variety of topics and hence
time periods).
A document with two daterange fields
<doc>
...
<field name="daterange">19820402,19820614</field>
<field name="daterange">1990,2000</field>
</doc>
gets stored internally as
<arr name="daterange"><str>19820402,19820614</str><str>1990,2000</
str></arr>
<arr name="daterange_0_i"><int>19820402</int><int>19900000</int></
arr>
<arr name="daterange_1_i"><int>19820614</int><int>20000000</int></
arr>
In this situation, searching for 1985 should yield zero results as it
is contained within neither daterange, however, the above document is
returned in the result set. What Solr is doing is checking that the
queryDate (1985) is greater than *any* of the values in daterange_0
AND queryDate is less than *any* of the values in daterange_1.
How can I get Solr to respect the positions of each item in the
daterange_0 and _1 arrays? Ideally I'd like the search to use the
following logic, thus preventing the above document from being
returned in a search for 1985:
(queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
(queryDate > daterange_0[1] AND queryDate < daterange_1[1])
Someone else had a very similar problem recently on the mailing list
with a multiValued PointType field but the thread went cold without a
final solution.
While I could filter the results when they get back to my application
layer, it seems like it's not really the right place to do it.
Any help getting Solr to respect the positions of items in arrays
would be very gratefully received.
Many thanks,
Mark
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.