Re: lucense index/document architecture

Erick Erickson Sat, 27 Jan 2007 09:20:01 -0800

I put in 1TB as a number because I thought it would surely be bigger than
anything you intended to put in your database. And you reply with 100 times
that size <G>.....


The index I'm working with now is 5GB, so I have no wisdom to offer you at
all about how to scale to 100TB. You should probably infer from Otis' reply
that we don't really know anybody who's tried off the top of our heads.

Good Luck!
Erick

On 1/27/07, Joost Schouten <[EMAIL PROTECTED]> wrote:

Erick, Otis,

Thank you for your help. I will work with a single index and parent
fields.
It's hard to say exactly how much raw data I will index as this differs
per
client. But I guess right now I'm more looking at 1G (contents of a
non-CLOB/BLOB DB). But one client is thinking of throwing their entire
100T
file system in it. Not quite sure how to handle that yet. Should I have a
different architecture with 100T compared to 1G?

Thanks,
Joost Schouten
Director

JS Portal
Dasstraat 21
2623CB Delft
the Netherlands
P: +31 6 160 160 14
E: [EMAIL PROTECTED]
W: www.jsportal.com

-----Original Message-----
From: Erick Erickson [mailto:[EMAIL PROTECTED]
Sent: Saturday, January 27, 2007 1:30 PM
To: java-user@lucene.apache.org
Subject: Re: lucense index/document architecture

To steal a phrase from Mr. Hatcher... it depends <G>. I'd try keeping it
all
in one index at the start until you get some clue how big the index will
eventually grow to and whether your searching is acceptable. Do you have
any
idea how big the raw data you're going to ask the index to hold? 1M? 1G?,
1T?

But it's simple enough to do what you want, just include a field for each
document, let's say Company. Your queries can easily search all documents
or
only those belonging to a single company by including an
"+company:companyyoucareabout". Or search all documents by leaving that
clause off.

Do be aware, when you're doing performance testing, that the first query,
particularly when sorting, takes significantly longer since Lucene will
build up some internal caches and you pay a penalty the first time
through.
Various strategies exist for pre-warming the searcher up by firing some
canned queries at the search engine as the server comes up......

If you're a database guy, you might not appreciate one thing that was hard
for me to understand; all documents in an index do NOT have to have the
same
fields. In fact, your index could theoretically have no two documents with
any field in common <G>.If you're used to thinking about static table
definitions in a database this can take a while to get used to.

Hope this helps
Erick

On 1/26/07, Joost Schouten <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I'm setting up lucene to work with our webapp to index a database. My db
> holds files which can belong to a user or a company or both. I want the
> option for my users to search across all content, but also search within
> the
> files for one user or company. What is the best architecture approach
for
> this? Do you add a field to the document with the parentId's, do you
make
> a
> different index for each user/company (can be 1000's) or is there a
> different solution all together?
>
> Thank you,
> Joost
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucense index/document architecture

Reply via email to