Hello,

There are a few ways to solve this but no
Date extraction filter I know of. Adding
a hundred fields for each Lucene doc
seems bloated.

First, get your text out of the various
source documents (.doc,.pdf,.html) using
available tools out there described in the
Lucene in Action book.

It sounds like you know Perl, so next
try regexes to pull out dates from the
text using java.util.regex and make sure to
remove extra whitespace.

Put your clean date Strings into a java TreeMap
or TreeSet collection to eliminate duplicates.

Finally, loop thru the collection adding items
to a StringBuffer delimited by commas, then make
one long String (holding all your dates) and add
to the Lucene doc as one Field.Text.

You might be able to set that Field to indexed, but not
stored to save space.

Regards,

Peter W.



On Feb 28, 2007, at 11:22 AM, Aigner, Thomas wrote:

Walt,
        I am no expert, but it sounds like you need to associate many
dates to a single record.
...

Tom


-----Original Message-----
From: Walt Stoneburner [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 28, 2007 2:13 PM
To: java-user@lucene.apache.org
Subject: Re: Soliciting Design Thoughts on Date Searching

...
The issue isn't in using DateRange on a Date Field, but in knowing if
there is some filter that already exists which extracts dates from a
body of text to put into a Date Field.  If not, the DateTool solution
is a helpful step in building my own filter; I just don't want to
reinvent the wheel if it already exists.

Now this is where my personal knowledge of Lucene breaks down.
Assuming I can extract each date from a source's body and convert it
to a usable format, can a Lucene Date Field hold more than one date?
For example, is a strict name/value pair, or can the value be a array
of dates, or can I append additional dates under the same name?

Super generalizing, to break the discussion from a date specific
example, suppose I did this:
document.add( Field.Text( "title", "Learning Perl, Fourth Edition" )
); // real title
document.add( Field.Text( "title", "Camel Book" ) );  // my wife knows
it by the cover

Could I do a search for both the long and short title against the title
field?

If the answer is yes, problem solved!  I'll just pile on a ton of
dates as I find them and add them to the document.  (Note, I could
easily have hundreds.)

for ( Date somedate : allDatesFoundInSource[] ) {
  document.add( Field.Text( "embeddedDates", somedate ) );  // Right
way to do this?
}


If the answer is no, it better illustrates the problem I face:
searching across an arbitrary collection of dates.


Erick, if I've missed something obvious in the archives, I'll happily
accept my public flogging.    Thanks for your help so far.

-wls

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to