1. No, I'm not using sort.  Actually I'm just going to start read that section.
2. No, I did only one search '1234567' to 'warmup' the searcher, then OOM
3. After IndexReader/searcher is created, I do a finalize and print
total mem used, then use '1234567' to do a search for warmup, and
another finalize + print total mem.
I prepared 3 indexes
A: from 2009/06 to now, 47G. 449M after reader created, then 649M
after first search.
B: from 2009/04 to now, 63G. 592M -> 869M
C: from 2009/01 to now, 100+G. 598M -> OOM
see, I have done nothing yet.
And I have data from year 2000 to index, while I have only 32bit
windows machines...
That's why I want to make 'distributed index'
And another reason, most likely the searches will be wildcard, it's
very slow in large single index.

2009/11/16 Erick Erickson <erickerick...@gmail.com>:
> I confess that I've just skimmed your e-mail, but there's absolutely
> no requirement that the entire index fit in RAM. The fact that your
> index is larger than available RAM isn't the reason you're hitting OOM.
>
> Typical reasons for this are:
> 1> you're sorting on a field with many, many, many unique values. If
> you're sorting on a fine-grained timestamp, this is quite possible.
> 2> You've bumped MAX_BOOLEAN_CLAUSES and are searching
> on, say, one-letter wildcards.
> 3> many other reasons.
>
> I agree with Jacob, jumping into a multi-machine solution without
> understanding the problem in detail may not be your best course.
>
> So, can you tell us more about the conditions under which you hit
> OOM? Maybe with more details we can come up with better solutions.
>
> If you absolutely *must* implement a multi-machine solution, have
> you seen ParallelMultiSearcher?
>
> Best
> Erick
>
> On Mon, Nov 16, 2009 at 2:13 AM, Wenbo Zhao <zha...@gmail.com> wrote:
>
>> Yes, exactly 'distributed'...
>> From maintenance point of view, the 'horizontal' expandable is very
>> important.
>> For my case, the data file is a kind of 'history' file, categorized
>> by date.  Once the data file is indexed, it will not change, unless
>> the searching fields changed.
>> Say I make whole ten years data indexed, generated 400G index,
>> requiring 8G ram.  When I do backup, I have to backup the entire 400G
>> every time.  I need another 8G machine for backup.  And 8G is not
>> enough, the index is increasing everyday.
>> Compare to distributed solution, I can split the index by year or by
>> seasons.  Say I have 10x40G index.  I can easily run 10 jvm process
>> each with 1G heap space, in 3-5 low cost not dedicated x86 machines.
>> Consider the backup, 9 of 10 indexes are old, only need backup once,
>> they won't change.  only 1 hot index is changing everyday, so I just
>> backup up to 40G.  The spare machine is also very cheap.  And the
>> machines are so cheap, I can use VMs to run this, it's more flexible
>> in resource management.  As time goes by, I just install new jvm
>> instance when needed.  I don't worry about ram and search speed
>> anymore.
>> I do think there should be more bigger cases out there just like mine.
>>  The general distributed Lucene will be very useful.  It will bring
>> Lucene to more enterprise applications, or more bigger, industry
>> applications.
>>
>>
>> 2009/11/16 Jacob Rhoden <jrho...@unimelb.edu.au>:
>> > Sounds like you may need to have some sort of distributed system, I just
>> > wanted to make sure you were aware of the cost/benifits of just buying a
>> big
>> > 62bit/8Gb ram machine, vs having to not only maintain and power several
>> 32
>> > bit machines, but also maintain and support your now more complicated
>> code.
>> >
>> > I have seen it too many times developers/companies spend so much money in
>> > not just the initial development, but long term support and maintenance
>> that
>> > could have been simplified by just buying a bigger/better more powerful
>> > machine in the first place.
>> >
>> > I am interested to see what other people have to say about how to solve
>> your
>> > problem.
>> >
>> > Best regards,
>> > Jacob
>> >
>> > On 16/11/2009, at 3:39 PM, Wenbo Zhao wrote:
>> >
>> >> My data is categorized by date.  About 14M+ docs per month, 37M+ terms.
>> >> When I use 1G heap size to do search of 10 month index, I got OOM.
>> >> The problem is I can't increase heap size in an easy way.
>> >> I have several machines, all 32bit windows, 4G ram.
>> >> And my goal is to index 10 year's data, plus more data every day !
>> >> If I put all of them together, I will need 8G+ ram to run search.
>> >> Maybe another 8G+ ram to run indexwriter.
>> >>
>> >> I think to split large index into smaller indexes and use a group of
>> >> machines to work as one is more flexible and faster compare to one
>> >> huge ram machine.
>> >> Any suggestions ?  beside more rams.
>> >>
>> >>
>> >> 2009/11/16 Jacob Rhoden <jrho...@unimelb.edu.au>:
>> >>>
>> >>> Not sure how large your index is,  but it might be easier (if possible
>> to
>> >>> increase your memory) than to develop a fairly complicated alternative
>> >>> strategy.
>> >>>
>> >>> On 16/11/2009, at 2:12 PM, Wenbo Zhao wrote:
>> >>>
>> >>>> Hi, all
>> >>>> I'm facing a large index, on a x86 win platform which may not have big
>> >>>> enough jvm heap space to hold the entire index.
>> >>>> So, I think it's possible to split the index into several smaller
>> >>>> indexes, run them in different jvm instances on different machine.
>> >>>> Then for each query, I can concurrently run it one every indexes and
>> >>>> merge the result together.
>> >>>> This can be a workaround of OutOfMemory issue.
>> >>>> But before I start to do this, I want to ask if Lucene already have a
>> >>>> solution for things like this.
>> >>>> Thanks.
>> >>>>
>> >>>> --
>> >>>>
>> >>>> Best Regards,
>> >>>> ZHAO, Wenbo
>> >>>>
>> >>>> =======================
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>>>
>> >>>
>> >>> ____________________________________
>> >>> Information Technology Services,
>> >>> The University of Melbourne
>> >>>
>> >>> Email: jrho...@unimelb.edu.au
>> >>> Phone: +61 3 8344 2884
>> >>> Mobile: +61 4 1095 7575
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Best Regards,
>> >> ZHAO, Wenbo
>> >>
>> >> =======================
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >
>> > ____________________________________
>> > Information Technology Services,
>> > The University of Melbourne
>> >
>> > Email: jrho...@unimelb.edu.au
>> > Phone: +61 3 8344 2884
>> > Mobile: +61 4 1095 7575
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>>
>>
>> --
>>
>> Best Regards,
>> ZHAO, Wenbo
>>
>> =======================
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>



-- 

Best Regards,
ZHAO, Wenbo

=======================

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to