Hi all,

Firstly, I apologise for the length of this email but I need to describe properly what I'm doing before I get to the problem!

I'm working on a project just now which requires the ability to store and search on temporal coverage data - ie. a field which specifies a date range during which a certain event took place.

I hunted around for a few days and couldn't find anything which seemed to fit, so I had a go at writing my own field type based on solr.PointType. It's used as follows:
  schema.xml
<fieldType name="temporal" class="solr.TemporalCoverage" dimension="2" subFieldSuffix="_i"/> <field name="daterange" type="temporal" indexed="true" stored="true" multiValued="true"/>
  data.xml
        <add>
        <doc>
        ...
        <field name="daterange">1940,1945</field>
        </doc>
        </add>

Internally, this gets stored as:
    <arr name="daterange"><str>1940,1945</str></arr>
    <int name="daterange_0_i">19400000</int>
    <int name="daterange_1_i">19450000</int>

In due course, I'll declare the subfields as a proper date type, but in the meantime, this works absolutely fine. I can search for an individual date and Solr will check (queryDate > daterange_0 AND queryDate < daterange_1 ) and the correct documents are returned. My code also allows the user to input a date range in the query but I won't complicate matters with that just now!

The problem arises when a document has more than one "daterange" field (imagine a news broadcast which covers a variety of topics and hence time periods).

A document with two daterange fields
        <doc>
        ...
        <field name="daterange">19820402,19820614</field>
        <field name="daterange">1990,2000</field>
        </doc>
gets stored internally as
<arr name="daterange"><str>19820402,19820614</str><str>1990,2000</ str></arr> <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></ arr> <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></ arr>

In this situation, searching for 1985 should yield zero results as it is contained within neither daterange, however, the above document is returned in the result set. What Solr is doing is checking that the queryDate (1985) is greater than *any* of the values in daterange_0 AND queryDate is less than *any* of the values in daterange_1.

How can I get Solr to respect the positions of each item in the daterange_0 and _1 arrays? Ideally I'd like the search to use the following logic, thus preventing the above document from being returned in a search for 1985: (queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR (queryDate > daterange_0[1] AND queryDate < daterange_1[1])

Someone else had a very similar problem recently on the mailing list with a multiValued PointType field but the thread went cold without a final solution.

While I could filter the results when they get back to my application layer, it seems like it's not really the right place to do it.

Any help getting Solr to respect the positions of items in arrays would be very gratefully received.

Many thanks,
Mark


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to