retrieve all docs efficiently - just one field

1world1love Tue, 10 Jun 2008 15:36:07 -0700

Greetings all. I have read many posts concerning similar use cases, but I am
still a little hazy on the best way to achieve what I need to do. Here is
the background:

2 million documents with multiple sections, some sections contain structured
data, some unstructured.

We parse the docs and place the structured stuff in oracle where each
section is a table and one master table to relate them all.

We index the unstructured sections with lucene where each section is a
document (meaning a total of about ~30 million documents) with extra fields
including one for the primary key of the master table and then some meta
fields to describe the section - type, date, etc.

For a common use case, say we have a table called demographics with a number
field that represents age (overly simplistic but gets the point across).

So say we want all people over the age of 50 who may have visited Panama:

--
We have our lucene index and we want to search the section text for the word
"panama"

AND

We want to select from the demographics table where age > 50.
--

Now I need to intersect the master table IDs from my lucene hits and my
table results.

I have a java stored procedure that runs the lucene query and creates a
temporary table with a single column where I insert the master id from the
hits of my lucene query. I then can do a join with my structured query
results.

The problem here is obviously the speed of iterating through the hits to
extract the single field that I need.

Notes:
- I must be able to get a full set of results, though I only need the one id
field
- We originally went with Oracle text which was simple, but limited and
quite slow for most queries

I have read a little about the hitcollector class and the fieldselector api,
but I am still not sure how they may help me or even if they can.

I have also tooled around with the idea of using termdocs, but the queries
may get a little complex with various ors/ands/nots, though probably not
spans and so forth.

Any suggestions will be greatly apreciated.

Thanks,

--
View this message in context:
http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17766268.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

retrieve all docs efficiently - just one field

Reply via email to