On Dec 13, 2006, at 1:51 PM, Patrick Turcotte wrote:
I would suggest you take a look at exist-db (http://exist-db.org/).
I really doubt eXist can handle 10M XML files. Last time I tried it,
it choked on 20k of them.
Erik
A database for XML documents that support XQuery.
We are using both products here (lucene and exist-db), and for what
you are
looking for, exist-db seems better.
Our documents are far more complex than yours (about 500 different
element
in the structure) and even if we don't have millions, we have more
than 53K
documents.
Once loaded in the database, performance are impressives to find
info on
documents parts (xpath) where no index exists. And for your
structure, you
could even create indexes which would boost performance even more.
Don't hesitate to contact me directly if you have more questions.
Patrick
On 12/13/06, Mark Mei <[EMAIL PROTECTED]> wrote:
At the bottom of this email is the sample xml file that we are using
today.
We have about 10 million of these.
We need to know whether Lucene can support the following
functionalities.
(1) Each field is searchable and indexable.
(2) Fields such as STARTTIME and ENDTIME need to be treated as a
pair so
that we can apply timestamp operation such as search by data time
ranges
(3) Fields such as DMA need to be treated as numerical and be able
to use
math operators ( > < =) for those fields.
We also use Apache Commons Digester to parse the xml files. So we
want to
know, can all of the above requirements be supported by combining
both
Digester and Lucene together, or do we need other modules in order
for us
to
support those requirements?
If these functionalities can be supported, please tell us about
the effort
involved (ie, do I need to rewrite 90% of Lucene/Digester to include
support
for these requirements, or is it more like spending one/two
afternoons
extending some classes ? )
<DOCUMENT>
<DREREFERENCE>61926433</DREREFERENCE>
<DREDBNAME>News</DREDBNAME>
<SEGMENTID>61829557</SEGMENTID>
<SHOWID>2051460</SHOWID>
<PROGRAMID>21181</PROGRAMID>
<PROGRAMNAME>Action 10 News This Morning</PROGRAMNAME>
<PREFIX>wthi0600</PREFIX>
<STATIONID>903</STATIONID>
<STATIONNAME>WTHI-TV</STATIONNAME>
<AFFILIATEID>17</AFFILIATEID>
<AFFILIATENAME>CBS</AFFILIATENAME>
<MARKETID>141</MARKETID>
<MARKETNAME>Terre Haute</MARKETNAME>
<MEDIATYPE>T</MEDIATYPE>
<DMA>149</DMA>
<SOURCETYPE>CC</SOURCETYPE>
<STARTTIME>2005-07-04 06:00:00</STARTTIME>
<ENDTIME>2005-07-04 07:00:00</ENDTIME>
<STARTMETER>00:42:53</STARTMETER>
<ENDMETER>00:45:02</ENDMETER>
<DREDATE>2006-01-25 00:00:00</DREDATE>
<DRETITLE>At we take you to break with a look at some of the
fourth of
July fun going on around the wabash valley today.</DRETITLE>
<DRECONTENT>At we take you to break with a look at some of the
fourth of
July fun going on around the wabash valley today. This is action
10 news
this morning on wthi. He's been the US Attorney general for only a
few
months. But alberto gonzales may already be in the running for a
new job.
And not just any job, either.</DRECONTENT>
</DOCUMENT>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]