Lucene RangeQuery would do for the "time" and "numeric" reqs.
"Mark Mei" <[EMAIL PROTECTED]> wrote: > At the bottom of this email is the sample xml file that we are using today. > We have about 10 million of these. > > We need to know whether Lucene can support the following functionalities. > (1) Each field is searchable and indexable. This is natural in Lucene. > (2) Fields such as STARTTIME and ENDTIME need to be treated as a pair so > that we can apply timestamp operation such as search by data time ranges You should be able to do it with two fields - start/end - and then you have all the flexibility for queries. Conditions can be set on either one of these or on both. In case that both are used, i.e. a doc matches only if its start-to-end range falls within a certain minStart maxEnd range, you would need two open ended range queries (or range filters) - one to apply on the start date and one to apply on the end date, because a single range query values must have the same field name. Also notice that open ended range queries must be created programmatically - current QueryParser does not support this. See using DateTools for saving the time values in a resolution that matches your needs. > (3) Fields such as DMA need to be treated as numerical and be able to use > math operators ( > < =) for those fields. Same comments on range queries / filters apply. Be aware though that comparison is lexicographic, so numeric values should be indexed as strings. See NumberTools. > > We also use Apache Commons Digester to parse the xml files. So we want to > know, can all of the above requirements be supported by combining both > Digester and Lucene together, or do we need other modules in order for us to > support those requirements? > If these functionalities can be supported, please tell us about the effort > involved (ie, do I need to rewrite 90% of Lucene/Digester to include support > for these requirements, or is it more like spending one/two afternoons > extending some classes ? ) Apart from the XML handling (don't know about that), for the other reqs it is just using Lucene's API. > > <DOCUMENT> > <DREREFERENCE>61926433</DREREFERENCE> > <DREDBNAME>News</DREDBNAME> > <SEGMENTID>61829557</SEGMENTID> > <SHOWID>2051460</SHOWID> > <PROGRAMID>21181</PROGRAMID> > <PROGRAMNAME>Action 10 News This Morning</PROGRAMNAME> > <PREFIX>wthi0600</PREFIX> > <STATIONID>903</STATIONID> > <STATIONNAME>WTHI-TV</STATIONNAME> > <AFFILIATEID>17</AFFILIATEID> > <AFFILIATENAME>CBS</AFFILIATENAME> > <MARKETID>141</MARKETID> > <MARKETNAME>Terre Haute</MARKETNAME> > <MEDIATYPE>T</MEDIATYPE> > <DMA>149</DMA> > <SOURCETYPE>CC</SOURCETYPE> > <STARTTIME>2005-07-04 06:00:00</STARTTIME> > <ENDTIME>2005-07-04 07:00:00</ENDTIME> > <STARTMETER>00:42:53</STARTMETER> > <ENDMETER>00:45:02</ENDMETER> > <DREDATE>2006-01-25 00:00:00</DREDATE> > <DRETITLE>At we take you to break with a look at some of the fourth of > July fun going on around the wabash valley today.</DRETITLE> > <DRECONTENT>At we take you to break with a look at some of the fourth of > July fun going on around the wabash valley today. This is action 10 news > this morning on wthi. He's been the US Attorney general for only a few > months. But alberto gonzales may already be in the running for a new job. > And not just any job, either.</DRECONTENT> > </DOCUMENT> --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]