The problem you may face that for such large documents,is that there
is a high probability that most of terms will be present in all
documents.

So on search you'll receive a lot of documents (if you need to
retrieve full text, it will take a while), but the bigger problem is
usability: what a user will do when instead of 100 000 documents the
search will narrow to 90 000 and adding more terms to search will go
down to 50 000?

On Tue, Mar 10, 2009 at 17:03, Amy Zhou <amy.z...@systemware.com> wrote:
> Thanks Eric for your quick response and useful information. I'll give a try 
> to bump up the MaxFieldLength and check the performance. It seems the 
> quickest way to handle the issue.
>
> Amy
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Tuesday, March 10, 2009 9:56 AM
> To: java-user@lucene.apache.org
> Subject: Re: index large size file
>
> Sure there are other options. You could decide to index in chunks
> rather then entire  documents. You could decide many things.
> None of which we can recommend unless we have a clue what
> you're really trying to accomplish or whether you're encountering
> a specific problem.
>
> I can say that we've indexed 7,000 *page* documents by bumping the
> MaxFieldLength. The performance is fine. I didn't measure indexing
> performance, but it ran acceptably quickly. Search performance seems
> unaffected, it's mostly dependent upon the overall index size and
> number of unique tokens as far as I can tell.
>
> I suggest you just try it and measure, that's the only way to determine
> whether *your* situation is adversely affected, since nobody can answer
> such a general question without considerably more specifics, and even
> then the answer is a qualified guess.
>
> But if you're *really* asking whether bumping MaxFieldLength does
> something like reserve that much space for every document whether
> or not it needs to, the answer is "no". A MaxFieldLength of 1,000,000,000
> won't use noticeably more resources for a file with 10 tokens than if the
> MaxFieldLength were 100. As far as I know.
>
> Best
> Erick
>
> On Tue, Mar 10, 2009 at 10:47 AM, Amy Zhou <amy.z...@systemware.com> wrote:
>
>> My issue here is that large file is truncated with default MaxFieldLength
>> 10,000 during indexing. The file size I index could be 10mb or larger.
>>
>> My questions are:
>>
>> 1) If I chose MaxFieldLength as UNLIMITED instead of 100,000, what the
>> performance could be?
>> 2) Any other options?
>>
>>
>> -----Original Message-----
>> From: Mark Miller [mailto:markrmil...@gmail.com]
>> Sent: Tuesday, March 10, 2009 9:37 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: index large size file
>>
>> Amy Zhou wrote:
>> > Hi,
>> >
>> > I'm having a couple of questions about indexing large size file. As my
>> understanding, the default MaxFieldLength 100,000. In Lucene 2.4, we can set
>> the MaxFieldLength during constructor. My questions are:
>> >
>> The default is 10,000.
>> > 1) How's the performance if MaxFieldLength is set to UNLIMITED?
>> >
>> It depends on how long your documents are. Its simply a cutoff -
>> documents longer than n (10,000 by default) terms will be truncated.
>> > 2) Any other options for indexing large size file?
>> >
>> What is the problem you are trying to address? Are you having trouble
>> indexing a very large file? Can you share more details?
>>
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to