1. No, I'm not using sort. Actually I'm just going to start read that section. 2. No, I did only one search '1234567' to 'warmup' the searcher, then OOM 3. After IndexReader/searcher is created, I do a finalize and print total mem used, then use '1234567' to do a search for warmup, and another finalize + print total mem. I prepared 3 indexes A: from 2009/06 to now, 47G. 449M after reader created, then 649M after first search. B: from 2009/04 to now, 63G. 592M -> 869M C: from 2009/01 to now, 100+G. 598M -> OOM see, I have done nothing yet. And I have data from year 2000 to index, while I have only 32bit windows machines... That's why I want to make 'distributed index' And another reason, most likely the searches will be wildcard, it's very slow in large single index.
2009/11/16 Erick Erickson <erickerick...@gmail.com>: > I confess that I've just skimmed your e-mail, but there's absolutely > no requirement that the entire index fit in RAM. The fact that your > index is larger than available RAM isn't the reason you're hitting OOM. > > Typical reasons for this are: > 1> you're sorting on a field with many, many, many unique values. If > you're sorting on a fine-grained timestamp, this is quite possible. > 2> You've bumped MAX_BOOLEAN_CLAUSES and are searching > on, say, one-letter wildcards. > 3> many other reasons. > > I agree with Jacob, jumping into a multi-machine solution without > understanding the problem in detail may not be your best course. > > So, can you tell us more about the conditions under which you hit > OOM? Maybe with more details we can come up with better solutions. > > If you absolutely *must* implement a multi-machine solution, have > you seen ParallelMultiSearcher? > > Best > Erick > > On Mon, Nov 16, 2009 at 2:13 AM, Wenbo Zhao <zha...@gmail.com> wrote: > >> Yes, exactly 'distributed'... >> From maintenance point of view, the 'horizontal' expandable is very >> important. >> For my case, the data file is a kind of 'history' file, categorized >> by date. Once the data file is indexed, it will not change, unless >> the searching fields changed. >> Say I make whole ten years data indexed, generated 400G index, >> requiring 8G ram. When I do backup, I have to backup the entire 400G >> every time. I need another 8G machine for backup. And 8G is not >> enough, the index is increasing everyday. >> Compare to distributed solution, I can split the index by year or by >> seasons. Say I have 10x40G index. I can easily run 10 jvm process >> each with 1G heap space, in 3-5 low cost not dedicated x86 machines. >> Consider the backup, 9 of 10 indexes are old, only need backup once, >> they won't change. only 1 hot index is changing everyday, so I just >> backup up to 40G. The spare machine is also very cheap. And the >> machines are so cheap, I can use VMs to run this, it's more flexible >> in resource management. As time goes by, I just install new jvm >> instance when needed. I don't worry about ram and search speed >> anymore. >> I do think there should be more bigger cases out there just like mine. >> The general distributed Lucene will be very useful. It will bring >> Lucene to more enterprise applications, or more bigger, industry >> applications. >> >> >> 2009/11/16 Jacob Rhoden <jrho...@unimelb.edu.au>: >> > Sounds like you may need to have some sort of distributed system, I just >> > wanted to make sure you were aware of the cost/benifits of just buying a >> big >> > 62bit/8Gb ram machine, vs having to not only maintain and power several >> 32 >> > bit machines, but also maintain and support your now more complicated >> code. >> > >> > I have seen it too many times developers/companies spend so much money in >> > not just the initial development, but long term support and maintenance >> that >> > could have been simplified by just buying a bigger/better more powerful >> > machine in the first place. >> > >> > I am interested to see what other people have to say about how to solve >> your >> > problem. >> > >> > Best regards, >> > Jacob >> > >> > On 16/11/2009, at 3:39 PM, Wenbo Zhao wrote: >> > >> >> My data is categorized by date. About 14M+ docs per month, 37M+ terms. >> >> When I use 1G heap size to do search of 10 month index, I got OOM. >> >> The problem is I can't increase heap size in an easy way. >> >> I have several machines, all 32bit windows, 4G ram. >> >> And my goal is to index 10 year's data, plus more data every day ! >> >> If I put all of them together, I will need 8G+ ram to run search. >> >> Maybe another 8G+ ram to run indexwriter. >> >> >> >> I think to split large index into smaller indexes and use a group of >> >> machines to work as one is more flexible and faster compare to one >> >> huge ram machine. >> >> Any suggestions ? beside more rams. >> >> >> >> >> >> 2009/11/16 Jacob Rhoden <jrho...@unimelb.edu.au>: >> >>> >> >>> Not sure how large your index is, but it might be easier (if possible >> to >> >>> increase your memory) than to develop a fairly complicated alternative >> >>> strategy. >> >>> >> >>> On 16/11/2009, at 2:12 PM, Wenbo Zhao wrote: >> >>> >> >>>> Hi, all >> >>>> I'm facing a large index, on a x86 win platform which may not have big >> >>>> enough jvm heap space to hold the entire index. >> >>>> So, I think it's possible to split the index into several smaller >> >>>> indexes, run them in different jvm instances on different machine. >> >>>> Then for each query, I can concurrently run it one every indexes and >> >>>> merge the result together. >> >>>> This can be a workaround of OutOfMemory issue. >> >>>> But before I start to do this, I want to ask if Lucene already have a >> >>>> solution for things like this. >> >>>> Thanks. >> >>>> >> >>>> -- >> >>>> >> >>>> Best Regards, >> >>>> ZHAO, Wenbo >> >>>> >> >>>> ======================= >> >>>> >> >>>> --------------------------------------------------------------------- >> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>>> >> >>> >> >>> ____________________________________ >> >>> Information Technology Services, >> >>> The University of Melbourne >> >>> >> >>> Email: jrho...@unimelb.edu.au >> >>> Phone: +61 3 8344 2884 >> >>> Mobile: +61 4 1095 7575 >> >>> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>> >> >>> >> >> >> >> >> >> >> >> -- >> >> >> >> Best Regards, >> >> ZHAO, Wenbo >> >> >> >> ======================= >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > >> > ____________________________________ >> > Information Technology Services, >> > The University of Melbourne >> > >> > Email: jrho...@unimelb.edu.au >> > Phone: +61 3 8344 2884 >> > Mobile: +61 4 1095 7575 >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> >> >> >> -- >> >> Best Regards, >> ZHAO, Wenbo >> >> ======================= >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > -- Best Regards, ZHAO, Wenbo ======================= --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org