Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-07 Thread Tricia Williams

Chris Hostetter wrote:

: I would expect field:2001-03 to be a hit on a partial match such as
: field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z].  I suppose that my
: expectation would be that field:2001-03 would be counted once per day for each
: day in its range. It would follow that a user looking for documents relating

...meanwhile someone else might expect that unless the ambiguous date must 
be entirely contained within the range being queried on.
  
If implemented in DateField I guess this behaviour would need to be 
configurable.
(your implication of counting once per day would have pretty weird results 
on faceting by the way)
  
I agree.  It would be possible to have one document hit on a query but 
have hundreds of facet categories with a count of one under this 
scheme.  I'm leaning towards the scenario I described where the document 
would be counted once in an other facet category if it is relevant 
through rounding.
with unambiguous dates, you can have exactly what you want just by being a 
little more verbose when indexing/quering, (and somoene else can have 
exactly what they want by being equally verbose using slightly differnet 
options/queries


in your case: i would suggest that you use two fields: date_low and 
date_high ... when you have an exact date (down to the smallest level of 
granularity you care about) you put the same value in both fields, when 
you have an ambiguous value (like 2001-03) you put the largest value 
possible in date_high and the lowest value possible in date_low (ie: 
date_low:2001-03-01T00:00:00Z  date_high:2001-03-31T23:59:59.999Z) then a 
query for anything *overlapping* the range from feb28 to march 13 would 
be...


+date_low:[* TO 2001-03-13T00:00:00Z] +date_high:[2001-02-28T00:00:00Z TO *]

...it works for ambiguous dates, and it works for exact dates.

(someone else who only wants to see matches if the ranges *completely* 
overlap would just swap which end point they queried against which field)
  
We've had a really similar solution in place for range queries for a 
while.  Our current problem is really faceting.


Thanks,
Tricia


Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-06 Thread Chris Hostetter

:My question is why isn't the DateField implementation of ISO 8601 broader
: so that it could include  and MM as acceptable date strings?  What

because those would be ambiguous.  if you just indexed field:2001-03 would 
you expect it to match field:[2001-02-28T00:00:00Z TO 
2001-03-13T00:00:00Z] ... what about date faceting, what should the 
counts be if you facet per day?

...your expectations may be different then everyone elses.  by requiring 
that the dates be explicit there is no ambiguity, you are in control of 
the behavior.

: would it take to do so?  Are there any work-arounds for faceting by century,
: year, month without creating new fields in my schema?  The last resort would

in can always just index the first date of whatever block of time (month, 
yera, century, etc..) and then facet normally.


-Hoss



Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-06 Thread Tricia Williams
Thanks for making me think about this a little bit deeper, Hoss.  
Comments in-line.


Chris Hostetter wrote:
because those would be ambiguous.  if you just indexed field:2001-03 would 
you expect it to match field:[2001-02-28T00:00:00Z TO 
2001-03-13T00:00:00Z] ... what about date faceting, what should the 
counts be if you facet per day?
  


I would expect field:2001-03 to be a hit on a partial match such as 
field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z].  I suppose that my 
expectation would be that field:2001-03 would be counted once per day 
for each day in its range. It would follow that a user looking for 
documents relating to 1919 might also be interested in 1910.  But 
conversely a user looking for documents relating to 1919 might really 
only want documents specifically related to 1919.  Maybe the 
implementation would be smart (or configurable) about precision so that 
it wouldn't be counted when the precision asked to be represented by 
facets had more significant figures that the indexed/stored value.  
Maybe there would be another facet category at each precision for 
others -- the documents that have less precision than the current date 
facet precision.  I'm envisioning a hierarchical system that starts 
general with century with click-throughs drilling down eventually to days.


...your expectations may be different then everyone elses.  by requiring 
that the dates be explicit there is no ambiguity, you are in control of 
the behavior.
  


I can see your point but surely there are others out there with non 
explicit data regarding dates out there?  Does my use case makes sense 
to anyone else?


in can always just index the first date of whatever block of time (month, 
yera, century, etc..) and then facet normally.


  
Until a better solution presents itself we've gone the route of creating 
more fields for faceting on different blocks of time.  So fields for 
century, decade, year, month, and day will let us facet on each of these 
time periods as needed.  Documents with dates with less precision will 
not show up in date facets with more precision.  I was hoping there was 
an elegant hack for faceting on prefix of a defined number of characters 
(prefix=*, prefix=**, prefix=***, ...) without having to explicitly 
specify ..., prefix=188, prefix=189, prefix=190, prefix=191, ...


Regards,
Tricia


Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-06 Thread Walter Lewis

On 6 Oct 09, at 5:31 PM, Chris Hostetter wrote:

...your expectations may be different then everyone elses.  by  
requiring
that the dates be explicit there is no ambiguity, you are in control  
of

the behavior.


The power of some of the other formulas in ISO 8601 is that you don't  
introduce false levels of precision.  The October 2009 issue of a  
magazine is precisely tagged as 200910 or 2009-10 .  It doesn't  
have a day, hour or minute.  Most books come with a copyright year: no  
month, no day ...


In the library/book/periodical world these are a common set of  
expectations.


Walter







Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-06 Thread Chris Hostetter

: I would expect field:2001-03 to be a hit on a partial match such as
: field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z].  I suppose that my
: expectation would be that field:2001-03 would be counted once per day for each
: day in its range. It would follow that a user looking for documents relating

...meanwhile someone else might expect that unless the ambiguous date must 
be entirely contained within the range being queried on.

(your implication of counting once per day would have pretty weird results 
on faceting by the way)

with unambiguous dates, you can have exactly what you want just by being a 
little more verbose when indexing/quering, (and somoene else can have 
exactly what they want by being equally verbose using slightly differnet 
options/queries

in your case: i would suggest that you use two fields: date_low and 
date_high ... when you have an exact date (down to the smallest level of 
granularity you care about) you put the same value in both fields, when 
you have an ambiguous value (like 2001-03) you put the largest value 
possible in date_high and the lowest value possible in date_low (ie: 
date_low:2001-03-01T00:00:00Z  date_high:2001-03-31T23:59:59.999Z) then a 
query for anything *overlapping* the range from feb28 to march 13 would 
be...

+date_low:[* TO 2001-03-13T00:00:00Z] +date_high:[2001-02-28T00:00:00Z TO *]

...it works for ambiguous dates, and it works for exact dates.

(someone else who only wants to see matches if the ranges *completely* 
overlap would just swap which end point they queried against which field)


-Hoss



Why isn't the DateField implementation of ISO 8601 broader?

2009-10-01 Thread Tricia Williams

Hi All,

   I'm working with data that has multiple date precisions most of 
which do not have a time associated with them, rather centuries (like 
1800's),  years (like 1867),  and years/months (like  1918-11).  I'm 
able to sort and search using a workaround where we store the date as a 
string CCYYMM where YYMM are optional.


   I was hoping to be able to tie this into the DateField type so that 
it becomes possible to facet on them without much work and duplication 
of data.  Unfortunately it requires the cannonical representation of 
dateTime which means the time part of the string is mandatory.


   My question is why isn't the DateField implementation of ISO 8601 
broader so that it could include  and MM as acceptable date 
strings?  What would it take to do so?  Are there any work-arounds for 
faceting by century, year, month without creating new fields in my 
schema?  The last resort would be to create these new fields but I'm 
hoping to leverage the power of the DateField and the trie to replace 
range stuff.


Thanks,
Tricia

Some interesting observations from tinkering with the DateFieldTest:

   * 2003-03-00T00:00:00Z becomes 2003-02-28T00:00:00Z
   * 2008-03-00T00:00:00Z becomes 2008-02-29T00:00:00Z
   * 2003-00-00T00:00:00Z becomes 2002-11-30T00:00:00Z
   * 2000-00-00T00:00:00Z becomes 1999-11-30T00:00:00Z
   * 1979-00-31T00:00:00Z becomes 1978-12-31T00:00:00Z
   * 2005-04-00T00:00:00Z becomes 2005-03-31T00:00:00Z
   * 1850-10-00T00:00:00Z becomes 1850-09-30T00:00:00Z

The rounding /YEAR, /MONTH, etc artificially imposes extra precision 
that the original data wouldn't have.  In any case where months are zero 
weird rounding happens.


Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-01 Thread Lance Norskog
 My question is why isn't the DateField implementation of ISO 8601 broader so 
 that it could include  and MM as acceptable date strings?  What would 
 it take to do so?

Nobody ever cared? But yes, you're right, the spurious precision is
annoying. However, there is no fuzzy search for dates so the
precision is always used. Let's say I want to limit it to 19th
century America culture. 1790-1910 are a fairly contiguous sequence
in US history, with a massive break at 1910 for WW1.

 Are there any work-arounds for faceting by century, year, month without 
 creating new fields in my schema?  The last resort would be to create these 
 new fields but I'm hoping to leverage the power of the DateField and the trie 
 to replace range stuff.

There are no workarounds as yet. You do not have to store the
century/year etc. fields, only index them.

Tries do not support faceting yet.

 Some interesting observations from tinkering with the DateFieldTest:
   * 2003-03-00T00:00:00Z becomes 2003-02-28T00:00:00Z

The date parser should blow up with these values!