I have indexed around 100 M of data with 512M to the JVM heap. So that gives you an idea. If every token is the same word in one file, shouldn't the tokenizer recognize that ?
Try using Luke. That helps solving lots of issues. - AZ On 9/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > I can't answer the question of why the same token > takes up memory, but I've indexed far more than > 20M of data in a single document field. As in on the > order of 150M. Of course I allocated 1G or so to the > JVM, so you might try that.... > > Best > Erick > > On 8/31/07, Per Lindberg <[EMAIL PROTECTED]> wrote: > > > > I'm creating a tokenized "content" Field from a plain text file > > using an InputStreamReader and new Field("content", in); > > > > The text file is large, 20 MB, and contains zillions lines, > > each with the the same 100-character token. > > > > That causes an OutOfMemoryError. > > > > Given that all tokens are the *same*, > > why should this cause an OutOfMemoryError? > > Shouldn't StandardAnalyzer just chug along > > and just note "ho hum, this token is the same"? > > That shouldn't take too much memory. > > > > Or have I missed something? > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > >