I have indexed around 100 M of data with 512M to the JVM heap. So that gives
you an idea. If every token is the same word in one file, shouldn't the
tokenizer recognize that ?

Try using Luke. That helps solving lots of issues.

-
AZ

On 9/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> I can't answer the question of why the same token
> takes up memory, but I've indexed far more than
> 20M of data in a single document field. As in on the
> order of 150M. Of course I allocated 1G or so to the
> JVM, so you might try that....
>
> Best
> Erick
>
> On 8/31/07, Per Lindberg <[EMAIL PROTECTED]> wrote:
> >
> > I'm creating a tokenized "content" Field from a plain text file
> > using an InputStreamReader and new Field("content", in);
> >
> > The text file is large, 20 MB, and contains zillions lines,
> > each with the the same 100-character token.
> >
> > That causes an OutOfMemoryError.
> >
> > Given that all tokens are the *same*,
> > why should this cause an OutOfMemoryError?
> > Shouldn't StandardAnalyzer just chug along
> > and just note "ho hum, this token is the same"?
> > That shouldn't take too much memory.
> >
> > Or have I missed something?
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

Reply via email to