Re: 400 MB Fields

Fuad Efendi Tue, 07 Jun 2011 16:49:25 -0700

I think the question is strange... May be you are wondering about possible
OOM exceptions? I think we can pass to Lucene single document containing
comma separated list of "term, term, ..." (few billion times)... Except
"stored" and "TermVectorComponent"...

I believe thousands companies already indexed millions documents with
average size few hundreds Mbytes... There should not be any limits (except
InputSource vs. ByteArray)

100,000 _unique_ terms vs. single document containing 100,000,000,000,000
of non-unique terms (and trying to store offsets)

What about "Spell Checker" feature? Is anyone tried to index single
terabytes-like document?

Personally, I indexed only small (up to 1000 bytes) documents-fields, but
I believe 500Mb is very common use case with PDFs (which vendors use
Lucene already? Eclipse? To index Eclipse Help file? Even Microsoft uses
Lucene...)

Fuad

On 11-06-07 7:02 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

>From older (2.4) Lucene days, I once indexed the 23 volume "Encyclopedia
>of Michigan Civil War Volunteers" in a single document/field, so it's
>probably
>within the realm of possibility at least <G>...
>
>Erick
>
>On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic
><otis_gospodne...@yahoo.com> wrote:
>> Hello,
>>
>> What are the biggest document fields that you've ever indexed in Solr
>>or that
>> you've heard of?  Ah, it must be Tom's Hathi trust. :)
>>
>> I'm asking because I just heard of a case of an index where some
>>documents
>> having a field that can be around 400 MB in size!  I'm curious if
>>anyone has any
>> experience with such monster fields?
>> Crazy?  Yes, sure.
>> Doable?
>>
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>

Re: 400 MB Fields

Reply via email to