800 million docs is on the high side for modern hardware.

If even one field has norms on, your talking almost 800 MB right there. And then if another Searcher is brought up well the old one is serving (which happens when you update)? Doubled.

Your best bet is to distribute across a couple machines.

To minimize you would want to turn off or down caching, don't facet, don't sort, turn off all norms, possibly get at the Lucene term interval and raise it. Drop on deck searchers setting. Even then, 800 million...time to distribute I'd think.

vivek sar wrote:
Some update on this issue,

1) I attached jconsole to my app and monitored the memory usage.
During indexing the memory usage goes up and down, which I think is
normal. The memory remains around the min heap size (4 G) for
indexing, but as soon as I run a search the tenured heap usage jumps
up to 6G and remains there. Subsequent searches increases the heap
usage even more until it reaches the max (8G) - after which everything
(indexing and searching becomes slow).

The search query is a very generic one in this case which goes through
all the cores (4 of them - 800 million records), finds 400million
matches and returns 100 rows.

Does the Solr searcher holds up the reference to objects in memory? I
couldn't find any settings that would tell me it does, but every
search causing heap to go up is definitely suspicious.

2) I ran the jmap histo to get the top objects (this is on a smaller
instance with 2 G memory, this is before running search - after
running search I wasn't able to run jmap),

 num     #instances         #bytes  class name
----------------------------------------------
   1:       3890855      222608992  [C
   2:       3891673      155666920  java.lang.String
   3:       3284341      131373640  org.apache.lucene.index.TermInfo
   4:       3334198      106694336  org.apache.lucene.index.Term
   5:           271       26286496  [J
   6:            16       26273936  [Lorg.apache.lucene.index.Term;
   7:            16       26273936  [Lorg.apache.lucene.index.TermInfo;
   8:        320512       15384576
org.apache.lucene.index.FreqProxTermsWriter$PostingList
   9:         10335       11554136  [I

I'm not sure what's the first one (C)? I couldn't profile it to know
what all the Strings are being allocated by - any ideas?

Any ideas on what Searcher might be holding on and how can we change
that behavior?

Thanks,
-vivek


On Thu, May 14, 2009 at 11:33 AM, vivek sar <vivex...@gmail.com> wrote:
I don't know if field type has any impact on the memory usage - does it?

Our use cases require complete matches, thus there is no need of any
analysis in most cases - does it matter in terms of memory usage?

Also, is there any default caching used by Solr if I comment out all
the caches under query in solrconfig.xml? I also don't have any
auto-warming queries.

Thanks,
-vivek

On Wed, May 13, 2009 at 4:24 PM, Erick Erickson <erickerick...@gmail.com> wrote:
Warning: I'm waaaay out of my competency range when I comment
on SOLR, but I've seen the statement that string fields are NOT
tokenized while text fields are, and I notice that almost all of your fields
are string type.

Would someone more knowledgeable than me care to comment on whether
this is at all relevant? Offered in the spirit that sometimes there are
things
so basic that only an amateur can see them <G>....

Best
Erick

On Wed, May 13, 2009 at 4:42 PM, vivek sar <vivex...@gmail.com> wrote:

Thanks Otis.

Our use case doesn't require any sorting or faceting. I'm wondering if
I've configured anything wrong.

I got total of 25 fields (15 are indexed and stored, other 10 are just
stored). All my fields are basic data type - which I thought are not
sorted. My id field is unique key.

Is there any field here that might be getting sorted?

 <field name="id" type="long" indexed="true" stored="true"
required="true" omitNorms="true" compressed="false"/>

  <field name="atmps" type="integer" indexed="false" stored="true"
compressed="false"/>
  <field name="bcid" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="cmpcd" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="ctry" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="dlt" type="date" indexed="false" stored="true"
default="NOW/HOUR"  compressed="false"/>
  <field name="dmn" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="eaddr" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="emsg" type="string" indexed="false" stored="true"
compressed="false"/>
  <field name="erc" type="string" indexed="false" stored="true"
compressed="false"/>
  <field name="evt" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="from" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="lfid" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="lsid" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="prsid" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="rc" type="string" indexed="false" stored="true"
compressed="false"/>
  <field name="rmcd" type="string" indexed="false" stored="true"
compressed="false"/>
  <field name="rmscd" type="string" indexed="false" stored="true"
compressed="false"/>
  <field name="scd" type="string" indexed="true" stored="true"
omitNorms="true" compressed="false"/>
  <field name="sip" type="string" indexed="false" stored="true"
compressed="false"/>
  <field name="ts" type="date" indexed="true" stored="false"
default="NOW/HOUR" omitNorms="true"/>


  <!-- catchall field, containing all other searchable text fields
(implemented
       via copyField further on in this schema  -->
  <field name="all" type="text_ws" indexed="true" stored="false"
omitNorms="true" multiValued="true"/>

Thanks,
-vivek

On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
<otis_gospodne...@yahoo.com> wrote:
Hi,
Some answers:
1) .tii files in the Lucene index.  When you sort, all distinct values
for the field(s) used for sorting.  Similarly for facet fields.  Solr
caches.
2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will
consume during indexing.  There is no need to commit every 50K docs unless
you want to trigger snapshot creation.
3) see 1) above

1.5 billion docs per instance where each doc is cca 1KB?  I doubt that's
going to fly. :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: vivek sar <vivex...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Wednesday, May 13, 2009 3:04:46 PM
Subject: Solr memory requirements?

Hi,

  I'm pretty sure this has been asked before, but I couldn't find a
complete answer in the forum archive. Here are my questions,

1) When solr starts up what does it loads up in the memory? Let's say
I've 4 cores with each core 50G in size. When Solr comes up how much
of it would be loaded in memory?

2) How much memory is required during index time? If I'm committing
50K records at a time (1 record = 1KB) using solrj, how much memory do
I need to give to Solr.

3) Is there a minimum memory requirement by Solr to maintain a certain
size index? Is there any benchmark on this?

Here are some of my configuration from solrconfig.xml,

1) 64
2) All the caches (under query tag) are commented out
3) Few others,
      a)  true    ==>
would this require memory?
      b)  50
      c) 200
      d)
      e) false
      f)  2

The problem we are having is following,

I've given Solr RAM of 6G. As the total index size (all cores
combined) start growing the Solr memory consumption  goes up. With 800
million documents, I see Solr already taking up all the memory at
startup. After that the commits, searches everything become slow. We
will be having distributed setup with multiple Solr instances (around
8) on four boxes, but our requirement is to have each Solr instance at
least maintain around 1.5 billion documents.

We are trying to see if we can somehow reduce the Solr memory
footprint. If someone can provide a pointer on what parameters affect
memory and what effects it has we can then decide whether we want that
parameter or not. I'm not sure if there is any minimum Solr
requirement for it to be able maintain large indexes. I've used Lucene
before and that didn't require anything by default - it used up memory
only during index and search times - not otherwise.

Any help is very much appreciated.

Thanks,
-vivek


--
- Mark

http://www.lucidimagination.com



Reply via email to