RE: Out of memory on Solr sorting

2008-08-05 Thread sundar shankar
Hi all,
I seemed to have found the solution to this problem. Apparently, 
allocating enough virtual memory on the server seems to only solve on half of 
the problem. Even after allocating 4 gigs of Virtual memory on jboss server, I 
still did get the Out of memory on sorting. 
 
I didn't how ever notice that the LRU cache on my config was set to default 
which was still 512 megs of max memory. I had to increase that to a round 2 
gigs and the sorting did work perfectly ok.
 
Even though I am satisfied that I have found the solution to the problem, i am 
still not satisfied to know that Sort consumes so much memory. In no products 
have I seen sorting 10 fields take up 1 gig and half of virtual memory. I am 
not sure, if there could be a better implementation of this. But something 
doesn't seem right to me.
 
Thanks for all your support. It has truly been overwhelming.
 
Sundar



 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: RE: Out of 
 memory on Solr sorting Date: Tue, 29 Jul 2008 10:43:05 -0700  A sneaky 
 source of OutOfMemory errors is the permanent generation. If you add this: 
 -XX:PermSize=64m -XX:MaxPermSize=96m You will increase the size of the 
 permanent generation. We found this helped.  Also note that when you 
 undeploy a war file, the old deployment has permanent storage that is not 
 reclaimed, and so each undeploy/redeploy cycle eats up the permanent 
 generation pool.  -Original Message- From: david w [mailto:[EMAIL 
 PROTECTED]  Sent: Tuesday, July 29, 2008 7:20 AM To: 
 solr-user@lucene.apache.org Subject: Re: Out of memory on Solr sorting  
 Hi, Daniel  I got the same probem like Sundar. Is that possible to tell me 
 what profiling tool you are using?  thx a lot.  /David  On Tue, Jul 
 29, 2008 at 8:19 PM, Daniel Alheiros [EMAIL PROTECTED]wrote:   Hi 
 Sundar.   Well it would be good if you could do some profiling on your 
 Solr app.  I've done it during the indexing process so I could figure out 
 what   was going on in the OutOfMemoryErrors I was getting.   But you 
 won't definitelly need to have as much memory as your whole   index size. I 
 have a 3.5 million documents (aprox. 10Gb) running on   this 2Gb heap VM. 
   Cheers,  Daniel   -Original Message-  From: sundar 
 shankar [mailto:[EMAIL PROTECTED]  Sent: 23 July 2008 23:45  To: 
 solr-user@lucene.apache.org  Subject: RE: Out of memory on Solr sorting  
   Hi Daniel,  I am afraid that didnt solve my problem. I was guessing my 
   problem was that I have too much of data and too little memory   
 allocated for that. I happened to read in couple of the posts which   
 mentioned that I need VM that is close to the size of my data(folder).   I 
 have like 540 Megs now and a little more than a million and a half   docs. 
 Ideally in that case 512 megs should be enough for me. In fact I   am able 
 to perform all other operations now, commit, optmize, select,   update, 
 nightly cron jobs to index data again. etc etc with no   hassles. Even my 
 load tests perform very well. Just the sort and it   doesnt seem to work. I 
 allocated  2 gigs of memory now. Still same results. Used the GC params u 
 gave me   too. No change what so ever. Am not sure, whats going on. Is 
 there   something that I can do to find out how much is needed in actuality 
 as   my production server might need to be configured in accordance.   
 I dont store any documents. We basically fetch standard column data   from 
 oracle database store them into Solr fields. Before I had   EdgeNGram 
 configured and had Solr 1.2, My data size was less that half   of what it 
 is right now. I guess if I remember right, it was of the   order of 100 
 megs. The max size of a field right now might not cross a 100 chars too.  
 Quizzled even more now.   -Sundar   P.S: My configurations :  Solr 
 1.3  Red hat  540 megs of data (1855013 docs)  2 gigs of memory 
 installed and allocated like this   JAVA_OPTS=$JAVA_OPTS -Xms2048m 
 -Xmx2048m -XX:MinHeapFreeRatio=50   -XX:NewSize=1024m  -XX:NewRatio=2 
 -Dsun.rmi.dgc.client.gcInterval=360  
 -Dsun.rmi.dgc.server.gcInterval=360   Jboss 4.05 Subject: 
 RE: Out of memory on Solr sorting   Date: Wed, 23 Jul 2008 10:49:06 +0100 
   From: [EMAIL PROTECTED]   To: solr-user@lucene.apache.org 
 Hi I haven't read the whole thread so I will take my chances here. 
 I've been fighting recently to keep my Solr instances stable because 
they were frequently crashing with OutOfMemoryErrors. I'm using Solr  
  1.2 and when it happens there is a bug that makes the index locked
 unless you restart Solr... So in my cenario it was extremelly  damaging.  
After some profiling I realized that my major problem was caused by  
   the way the JVM heap was being used as I haven't configured it to
 run using any advanced configuration (I had just made it bigger -Xmx 
 and Xms 1.5 Gb), it's running on Sun JVM 1.5 (the most recent1.5   
 available) and it's deployed

RE: Out of memory on Solr sorting

2008-08-05 Thread Fuad Efendi

Hi Sundar,


If increasing LRU cache helps you:
- you are probably using 'tokenized' field for sorting (could you  
confirm please?)...


...you should use 'non-tokenized single-valued non-boolean' for better  
performance of

sorting...


Fuad Efendi
==
http://www.tokenizer.org



Quoting sundar shankar [EMAIL PROTECTED]:


Hi all,
I seemed to have found the solution to this problem.   
Apparently, allocating enough virtual memory on the server seems to   
only solve on half of the problem. Even after allocating 4 gigs of   
Virtual memory on jboss server, I still did get the Out of memory on  
 sorting.


I didn't how ever notice that the LRU cache on my config was set to   
default which was still 512 megs of max memory. I had to increase   
that to a round 2 gigs and the sorting did work perfectly ok.


Even though I am satisfied that I have found the solution to the   
problem, i am still not satisfied to know that Sort consumes so much  
 memory. In no products have I seen sorting 10 fields take up 1 gig   
and half of virtual memory. I am not sure, if there could be a   
better implementation of this. But something doesn't seem right to me.


Thanks for all your support. It has truly been overwhelming.

Sundar







RE: Out of memory on Solr sorting

2008-08-05 Thread sundar shankar



The field is of type text_ws. Is this not recomended. Should I use text 
instead?

 Date: Tue, 5 Aug 2008 10:58:35 -0700 From: [EMAIL PROTECTED] To: [EMAIL 
 PROTECTED] Subject: RE: Out of memory on Solr sorting  Hi Sundar,   If 
 increasing LRU cache helps you: - you are probably using 'tokenized' field 
 for sorting (could you  confirm please?)...  ...you should use 
 'non-tokenized single-valued non-boolean' for better  performance of 
 sorting...   Fuad Efendi == http://www.tokenizer.org  
 Quoting sundar shankar [EMAIL PROTECTED]:   Hi all,  I seemed to have 
 found the solution to this problem.   Apparently, allocating enough virtual 
 memory on the server seems to   only solve on half of the problem. Even 
 after allocating 4 gigs of   Virtual memory on jboss server, I still did 
 get the Out of memory on   sorting.   I didn't how ever notice that the 
 LRU cache on my config was set to   default which was still 512 megs of max 
 memory. I had to increase   that to a round 2 gigs and the sorting did work 
 perfectly ok.   Even though I am satisfied that I have found the solution 
 to the   problem, i am still not satisfied to know that Sort consumes so 
 much   memory. In no products have I seen sorting 10 fields take up 1 gig  
  and half of virtual memory. I am not sure, if there could be a   better 
 implementation of this. But something doesn't seem right to me.   Thanks 
 for all your support. It has truly been overwhelming.   Sundar

Movies, sports  news! Get your daily entertainment fix, only on live.com Try 
it now! 
_
Searching for the best deals on travel? Visit MSN Travel.
http://msn.coxandkings.co.in/cnk/cnk.do

RE: Out of memory on Solr sorting

2008-08-05 Thread Fuad Efendi
My understanding of Lucene Sorting is that it will sort by 'tokens'  
and not by 'full fields'... so that for sorting you need 'full-string'  
(non-tokenized) field, and to search you need another one tokenized.


For instance, use 'string' for sorting, and 'text_ws' for search; and  
use 'copyField'... (some memory for copyField)


Sorting using tokenized field: 100,000 documents, each 'Book Title'  
consists of 10 tokens in average, ... - total 1,000,000 (probably  
unique) tokens in a hashtable; with nontokenized field - 100,000  
entries, and Lucene internal FieldCache is used instead of SOLR LRU.



Also, with tokenized fields 'sorting' is not natural (alphabetical order)...


Fuad Efendi
==
http://www.linkedin.com/in/liferay

Quoting sundar shankar [EMAIL PROTECTED]:


The field is of type text_ws. Is this not recomended. Should I use  
 text instead?


If increasing LRU cache helps you: -  you are probably using  
'tokenized' field for sorting (could you   confirm please?)...  
...you should use 'non-tokenized  single-valued non-boolean' for  
better performance of sorting...





RE: Out of memory on Solr sorting

2008-08-05 Thread Fuad Efendi

Best choice for sorting field:
!-- This is an example of using the KeywordTokenizer along
 With various TokenFilterFactories to produce a sortable field
 that does not include some properties of the source text
  --
fieldType name=alphaOnlySort class=solr.TextField  
sortMissingLast=true omitNorms=true


- case-insentitive etc...


I might be partially wrong about SOLR LRU Cache but it is used somehow  
in your specific case... 'filterCache' is probably used for  
'tokenized' sorting: it stores (token, DocList)...



Fuad Efendi
==
http://www.tokenizer.org


Quoting Fuad Efendi [EMAIL PROTECTED]:


My understanding of Lucene Sorting is that it will sort by 'tokens' and
not by 'full fields'... so that for sorting you need 'full-string'
(non-tokenized) field, and to search you need another one tokenized.

For instance, use 'string' for sorting, and 'text_ws' for search; and
use 'copyField'... (some memory for copyField)

Sorting using tokenized field: 100,000 documents, each 'Book Title'
consists of 10 tokens in average, ... - total 1,000,000 (probably
unique) tokens in a hashtable; with nontokenized field - 100,000
entries, and Lucene internal FieldCache is used instead of SOLR LRU.


Also, with tokenized fields 'sorting' is not natural (alphabetical order)...


Fuad Efendi
==
http://www.linkedin.com/in/liferay

Quoting sundar shankar [EMAIL PROTECTED]:


The field is of type text_ws. Is this not recomended. Should I   
use  text instead?


If increasing LRU cache helps you: -  you are probably using   
'tokenized' field for sorting (could you   confirm please?)...   
...you should use 'non-tokenized  single-valued non-boolean' for   
better performance of sorting...






Re: Out of memory on Solr sorting

2008-08-05 Thread Yonik Seeley
On Tue, Aug 5, 2008 at 1:59 PM, Fuad Efendi [EMAIL PROTECTED] wrote:
 If increasing LRU cache helps you:
 - you are probably using 'tokenized' field for sorting (could you confirm
 please?)...

Sorting does not utilize any Solr caches.

-Yonik


Re: Out of memory on Solr sorting

2008-08-05 Thread Fuad Efendi
I know, and this is strange... I was guessing filterCache is used  
implicitly to get DocSet for token; as Sundar wrote, increase of  
LRUCache helped him (he is sorting on 'text-ws' field)

-Fuad

If increasing LRU cache helps you:
- you are probably using 'tokenized' field for sorting (could you confirm
please?)...


Sorting does not utilize any Solr caches.

-Yonik







RE: Out of memory on Solr sorting

2008-08-05 Thread sundar shankar
Yes this is what I did. I got an out of memory while executing a query with a 
sort param
 
1. Stopped Jboss server
 
2. 
filterCache  class=solr.LRUCache  size=2048  initialSize=512 
 autowarmCount=256/
   !-- queryResultCache caches results of searches - ordered lists of 
document ids (DocList) based on a query, a sort, and the range of 
documents requested.  --queryResultCache  class=solr.LRUCache  
size=2048  initialSize=512  autowarmCount=256/
  !-- documentCache caches Lucene Document objects (the stored fields for each 
document).   Since Lucene internal document ids are transient, this cache 
will not be autowarmed.  --documentCache  class=solr.LRUCache  
size=2048  initialSize=512  autowarmCount=0/
In these 3 params, I changed size from 512 to 2048. 3. Restarted the server
4. Ran query again.
It worked just fine. after that. I am currently reinexing, replaving the 
text_ws to string and having the default size of all 3 caches to 512 and seeing 
if the problem goes away.
 
-Sundar



 Date: Tue, 5 Aug 2008 14:05:05 -0700 From: [EMAIL PROTECTED] To: 
 solr-user@lucene.apache.org Subject: Re: Out of memory on Solr sorting  I 
 know, and this is strange... I was guessing filterCache is used  implicitly 
 to get DocSet for token; as Sundar wrote, increase of  LRUCache helped him 
 (he is sorting on 'text-ws' field) -Fuad  If increasing LRU cache helps 
 you:  - you are probably using 'tokenized' field for sorting (could you 
 confirm  please?)...   Sorting does not utilize any Solr caches.   
 -Yonik
_
Searching for the best deals on travel? Visit MSN Travel.
http://msn.coxandkings.co.in/cnk/cnk.do

RE: Out of memory on Solr sorting

2008-08-05 Thread Fuad Efendi
Sundar, very strange that increase of size/initialSize of LRUCache  
helps with OutOfMemoryError...


2048 is number of entries in cache and _not_ 2Gb of memory...

Making size==initialSize of HashMap-based LRUCache would help with  
performance anyway; may be with OOMs (probably no need to resize  
HashMap...)





In these 3 params, I changed size from 512 to 2048. 3. Restarted the server




sorting  I know, and this is strange... I was guessing   
filterCache is used  implicitly to get DocSet for token; as Sundar  
 wrote, increase of  LRUCache helped him (he is sorting on   
'text-ws' field) -Fuad  If increasing LRU cache helps you:
- you are probably using 'tokenized' field for sorting (could you   
confirm  please?)...   Sorting does not utilize any Solr   
caches.   -Yonik   

_
Searching for the best deals on travel? Visit MSN Travel.
http://msn.coxandkings.co.in/cnk/cnk.do






RE: Out of memory on Solr sorting

2008-08-05 Thread sundar shankar
Oh Wow, I didnt know that was the case. I am completely left baffled now. BAck 
to square one I guess. :)

 Date: Tue, 5 Aug 2008 14:31:28 -0700 From: [EMAIL PROTECTED] To: 
 solr-user@lucene.apache.org Subject: RE: Out of memory on Solr sorting  
 Sundar, very strange that increase of size/initialSize of LRUCache  helps 
 with OutOfMemoryError...  2048 is number of entries in cache and _not_ 2Gb 
 of memory...  Making size==initialSize of HashMap-based LRUCache would help 
 with  performance anyway; may be with OOMs (probably no need to resize  
 HashMap...) In these 3 params, I changed size from 512 to 2048. 3. 
 Restarted the server sorting  I know, and this is strange... I 
 was guessing   filterCache is used  implicitly to get DocSet for token; 
 as Sundar   wrote, increase of  LRUCache helped him (he is sorting on  
  'text-ws' field) -Fuad  If increasing LRU cache helps you:- 
 you are probably using 'tokenized' field for sorting (could you   confirm 
  please?)...   Sorting does not utilize any Solr   caches.   
 -Yonik 
 _  
 Searching for the best deals on travel? Visit MSN Travel.  
 http://msn.coxandkings.co.in/cnk/cnk.do   
_
Searching for the best deals on travel? Visit MSN Travel.
http://msn.coxandkings.co.in/cnk/cnk.do

RE: Out of memory on Solr sorting

2008-07-29 Thread Lance Norskog
A sneaky source of OutOfMemory errors is the permanent generation.  If you
add this:
-XX:PermSize=64m -XX:MaxPermSize=96m
You will increase the size of the permanent generation. We found this
helped.

Also note that when you undeploy a war file, the old deployment has
permanent storage that is not reclaimed, and so each undeploy/redeploy cycle
eats up the permanent generation pool.

-Original Message-
From: david w [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 29, 2008 7:20 AM
To: solr-user@lucene.apache.org
Subject: Re: Out of memory on Solr sorting

Hi, Daniel

  I got the same probem like Sundar. Is that possible to tell me what
profiling tool you are using?

  thx a lot.

/David

On Tue, Jul 29, 2008 at 8:19 PM, Daniel Alheiros
[EMAIL PROTECTED]wrote:

 Hi Sundar.

 Well it would be good if you could do some profiling on your Solr app.
 I've done it during the indexing process so I could figure out what 
 was going on in the OutOfMemoryErrors I was getting.

 But you won't definitelly need to have as much memory as your whole 
 index size. I have a 3.5 million documents (aprox. 10Gb) running on 
 this 2Gb heap VM.

 Cheers,
 Daniel

 -Original Message-
 From: sundar shankar [mailto:[EMAIL PROTECTED]
 Sent: 23 July 2008 23:45
 To: solr-user@lucene.apache.org
 Subject: RE: Out of memory on Solr sorting


 Hi Daniel,
 I am afraid that didnt solve my problem. I was guessing my 
 problem was that I have too much of data and too little memory 
 allocated for that. I happened to read in couple of the posts which 
 mentioned that I need VM that is close to the size of my data(folder). 
 I have like 540 Megs now and a little more than a million and a half 
 docs. Ideally in that case 512 megs should be enough for me. In fact I 
 am able to perform all other operations now, commit, optmize, select, 
 update, nightly cron jobs to index data again. etc etc with no 
 hassles. Even my load tests perform very well. Just the sort and it 
 doesnt seem to work. I allocated
 2 gigs of memory now. Still same results. Used the GC params u gave me 
 too. No change what so ever. Am not sure, whats going on. Is there 
 something that I can do to find out how much is needed in actuality as 
 my production server might need to be configured in accordance.

 I dont store any documents. We basically fetch standard column data 
 from oracle database store them into Solr fields. Before I had 
 EdgeNGram configured and had Solr 1.2, My data size was less that half 
 of what it is right now. I guess if I remember right, it was of the 
 order of 100 megs. The max size of a field right now might not cross a 100
chars too.
 Quizzled even more now.

 -Sundar

 P.S: My configurations :
 Solr 1.3
 Red hat
 540 megs of data (1855013 docs)
 2 gigs of memory installed and allocated like this 
 JAVA_OPTS=$JAVA_OPTS -Xms2048m -Xmx2048m -XX:MinHeapFreeRatio=50 
 -XX:NewSize=1024m
 -XX:NewRatio=2 -Dsun.rmi.dgc.client.gcInterval=360
 -Dsun.rmi.dgc.server.gcInterval=360

 Jboss 4.05


  Subject: RE: Out of memory on Solr sorting
  Date: Wed, 23 Jul 2008 10:49:06 +0100
  From: [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
 
  Hi
 
  I haven't read the whole thread so I will take my chances here.
 
  I've been fighting recently to keep my Solr instances stable because 
  they were frequently crashing with OutOfMemoryErrors. I'm using Solr
  1.2 and when it happens there is a bug that makes the index locked 
  unless you restart Solr... So in my cenario it was extremelly
 damaging.
 
  After some profiling I realized that my major problem was caused by 
  the way the JVM heap was being used as I haven't configured it to 
  run using any advanced configuration (I had just made it bigger - 
  Xmx and Xms 1.5 Gb), it's running on Sun JVM 1.5 (the most recent 
  1.5
  available) and it's deployed on a Jboss 4.2 on a RHEL.
 
  So my findings were too many objects were being allocated on the old 
  generation area of the heap, which makes them harder to be disposed, 
  and also the default behaviour was letting the heap get too filled 
  up before kicking a GC and according to the JVM specs the default is 
  if after a short period when a full gc is executed if a certain 
  percentage of the heap is not freed an OutOfMemoryError should be
 thrown.
 
  I've changed my JVM startup params and it's working extremelly 
  stable since then:
 
  -Xmx2048m -Xms2048m -XX:MinHeapFreeRatio=50 -XX:NewSize=1024m
  -XX:NewRatio=2 -Dsun.rmi.dgc.client.gcInterval=360
  -Dsun.rmi.dgc.server.gcInterval=360
 
  I hope it helps.
 
  Regards,
  Daniel Alheiros
 
  -Original Message-
  From: Fuad Efendi [mailto:[EMAIL PROTECTED]
  Sent: 22 July 2008 23:23
  To: solr-user@lucene.apache.org
  Subject: RE: Out of memory on Solr sorting
 
  Yes, it is a cache, it stores sorted by sorted field array of 
  Document IDs together with sorted fields; query results can 
  intersect with it and reorder accordingly

RE: Out of memory on Solr sorting

2008-07-23 Thread Daniel Alheiros
Hi

I haven't read the whole thread so I will take my chances here.

I've been fighting recently to keep my Solr instances stable because
they were frequently crashing with OutOfMemoryErrors. I'm using Solr 1.2
and when it happens there is a bug that makes the index locked unless
you restart Solr... So in my cenario it was extremelly damaging.

After some profiling I realized that my major problem was caused by the
way the JVM heap was being used as I haven't configured it to run using
any advanced configuration (I had just made it bigger - Xmx and Xms 1.5
Gb), it's running on Sun JVM 1.5 (the most recent 1.5 available) and
it's deployed on a Jboss 4.2 on a RHEL. 

So my findings were too many objects were being allocated on the old
generation area of the heap, which makes them harder to be disposed, and
also the default behaviour was letting the heap get too filled up before
kicking a GC and according to the JVM specs the default is if after a
short period when a full gc is executed if a certain percentage of the
heap is not freed an OutOfMemoryError should be thrown.

I've changed my JVM startup params and it's working extremelly stable
since then:

-Xmx2048m -Xms2048m -XX:MinHeapFreeRatio=50 -XX:NewSize=1024m
-XX:NewRatio=2 -Dsun.rmi.dgc.client.gcInterval=360
-Dsun.rmi.dgc.server.gcInterval=360

I hope it helps.

Regards,
Daniel Alheiros

-Original Message-
From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
Sent: 22 July 2008 23:23
To: solr-user@lucene.apache.org
Subject: RE: Out of memory on Solr sorting

Yes, it is a cache, it stores sorted by sorted field array of
Document IDs together with sorted fields; query results can intersect
with it and reorder accordingly.

But memory requirements should be well documented.

It uses internally WeakHashMap which is not good(!!!) - a lot of
underground warming ups of caches which SOLR is not aware of...  
Could be.

I think Lucene-SOLR developers should join this discussion:


/**
  * Expert: The default cache implementation, storing all values in
memory.
  * A WeakHashMap is used for storage.
  *
..

   // inherit javadocs
   public StringIndex getStringIndex(IndexReader reader, String field)
   throws IOException {
 return (StringIndex) stringsIndexCache.get(reader, field);
   }

   Cache stringsIndexCache = new Cache() {

 protected Object createValue(IndexReader reader, Object fieldKey)
 throws IOException {
   String field = ((String) fieldKey).intern();
   final int[] retArray = new int[reader.maxDoc()];
   String[] mterms = new String[reader.maxDoc()+1];
   TermDocs termDocs = reader.termDocs();
   TermEnum termEnum = reader.terms (new Term (field, ));






Quoting Fuad Efendi [EMAIL PROTECTED]:

 I am hoping [new StringIndex (retArray, mterms)] is called only once 
 per-sort-field and cached somewhere at Lucene;

 theoretically you need multiply number of documents on size of field 
 (supposing that field contains unique text); you need not tokenize 
 this field; you need not store TermVector.

 for 2 000 000 documents with simple untokenized text field such as 
 title of book (256 bytes) you need probably 512 000 000 bytes per 
 Searcher, and as Mark mentioned you should limit number of searchers 
 in SOLR.

 So that Xmx512M is definitely not enough even for simple cases...


 Quoting sundar shankar [EMAIL PROTECTED]:

 I haven't seen the source code before, But I don't know why the
 sorting isn't done after the fetch is done. Wouldn't that make it
 more faster. at least in case of field level sorting? I could be
 wrong too and the implementation might probably be better. But   
 don't  know why all of the fields have had to be loaded.





 Date: Tue, 22 Jul 2008 14:26:26 -0700 From: [EMAIL PROTECTED] To:
 solr-user@lucene.apache.org Subject: Re: Out of memory on Solr
 sorting   Ok, after some analysis of FieldCacheImpl:  - it is
   supposed that (sorted) Enumeration of terms is less than
 total  number of documents (so that SOLR uses specific field type  
 for  sorted searches:  solr.StrField with omitNorms=true) 
 It   creates int[reader.maxDoc()] array, checks (sorted)  
 Enumeration  of   terms (untokenized solr.StrField), and 
 populates array  with  document  Ids.   - it also creates array 
 of String  String[]  mterms = new String[reader.maxDoc()+1];  Why

 do we  need that? For  1G
 document with average term/StrField size  of  100 bytes (which   
 could be unique text!!!) it will create kind of   huge 100Gb cache

 which is not really needed... StringIndex  value = new StringIndex

 (retArray, mterms);  If I understand  correctly...
 StringIndex  _must_ be a file in a  filesystem for  such a case... 
 We create  StringIndex, and retrieve top  10  documents, huge 
 overhead.   
  Quoting Fuad Efendi [EMAIL PROTECTED]:Ok, what is
 confusing me is implicit guess that FieldCache contains  field  
   and Lucene uses in-memory sort

RE: Out of memory on Solr sorting

2008-07-23 Thread sundar shankar

Hi Daniel,
 I am afraid that didnt solve my problem. I was guessing my problem 
was that I have too much of data and too little memory allocated for that. I 
happened to read in couple of the posts which mentioned that I need VM that is 
close to the size of my data(folder). I have like 540 Megs now and a little 
more than a million and a half docs. Ideally in that case 512 megs should be 
enough for me. In fact I am able to perform all other operations now, commit, 
optmize, select, update, nightly cron jobs to index data again. etc etc with no 
hassles. Even my load tests perform very well. Just the sort and it doesnt seem 
to work. I allocated 2 gigs of memory now. Still same results. Used the GC 
params u gave me too. No change what so ever. Am not sure, whats going on. Is 
there something that I can do to find out how much is needed in actuality as my 
production server might need to be configured in accordance.

I dont store any documents. We basically fetch standard column data from oracle 
database store them into Solr fields. Before I had EdgeNGram configured and had 
Solr 1.2, My data size was less that half of what it is right now. I guess if I 
remember right, it was of the order of 100 megs. The max size of a field right 
now might not cross a 100 chars too. Quizzled even more now. 

-Sundar

P.S: My configurations : 
Solr 1.3 
Red hat 
540 megs of data (1855013 docs)
2 gigs of memory installed and allocated like this
JAVA_OPTS=$JAVA_OPTS -Xms2048m -Xmx2048m -XX:MinHeapFreeRatio=50 
-XX:NewSize=1024m -XX:NewRatio=2 -Dsun.rmi.dgc.client.gcInterval=360 
-Dsun.rmi.dgc.server.gcInterval=360

Jboss 4.05


 Subject: RE: Out of memory on Solr sorting
 Date: Wed, 23 Jul 2008 10:49:06 +0100
 From: [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 
 Hi
 
 I haven't read the whole thread so I will take my chances here.
 
 I've been fighting recently to keep my Solr instances stable because
 they were frequently crashing with OutOfMemoryErrors. I'm using Solr 1.2
 and when it happens there is a bug that makes the index locked unless
 you restart Solr... So in my cenario it was extremelly damaging.
 
 After some profiling I realized that my major problem was caused by the
 way the JVM heap was being used as I haven't configured it to run using
 any advanced configuration (I had just made it bigger - Xmx and Xms 1.5
 Gb), it's running on Sun JVM 1.5 (the most recent 1.5 available) and
 it's deployed on a Jboss 4.2 on a RHEL. 
 
 So my findings were too many objects were being allocated on the old
 generation area of the heap, which makes them harder to be disposed, and
 also the default behaviour was letting the heap get too filled up before
 kicking a GC and according to the JVM specs the default is if after a
 short period when a full gc is executed if a certain percentage of the
 heap is not freed an OutOfMemoryError should be thrown.
 
 I've changed my JVM startup params and it's working extremelly stable
 since then:
 
 -Xmx2048m -Xms2048m -XX:MinHeapFreeRatio=50 -XX:NewSize=1024m
 -XX:NewRatio=2 -Dsun.rmi.dgc.client.gcInterval=360
 -Dsun.rmi.dgc.server.gcInterval=360
 
 I hope it helps.
 
 Regards,
 Daniel Alheiros
 
 -Original Message-
 From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
 Sent: 22 July 2008 23:23
 To: solr-user@lucene.apache.org
 Subject: RE: Out of memory on Solr sorting
 
 Yes, it is a cache, it stores sorted by sorted field array of
 Document IDs together with sorted fields; query results can intersect
 with it and reorder accordingly.
 
 But memory requirements should be well documented.
 
 It uses internally WeakHashMap which is not good(!!!) - a lot of
 underground warming ups of caches which SOLR is not aware of...  
 Could be.
 
 I think Lucene-SOLR developers should join this discussion:
 
 
 /**
   * Expert: The default cache implementation, storing all values in
 memory.
   * A WeakHashMap is used for storage.
   *
 ..
 
// inherit javadocs
public StringIndex getStringIndex(IndexReader reader, String field)
throws IOException {
  return (StringIndex) stringsIndexCache.get(reader, field);
}
 
Cache stringsIndexCache = new Cache() {
 
  protected Object createValue(IndexReader reader, Object fieldKey)
  throws IOException {
String field = ((String) fieldKey).intern();
final int[] retArray = new int[reader.maxDoc()];
String[] mterms = new String[reader.maxDoc()+1];
TermDocs termDocs = reader.termDocs();
TermEnum termEnum = reader.terms (new Term (field, ));
 
 
 
 
 
 
 Quoting Fuad Efendi [EMAIL PROTECTED]:
 
  I am hoping [new StringIndex (retArray, mterms)] is called only once 
  per-sort-field and cached somewhere at Lucene;
 
  theoretically you need multiply number of documents on size of field 
  (supposing that field contains unique text); you need not tokenize 
  this field; you need not store TermVector.
 
  for 2 000 000

RE: Out of memory on Solr sorting

2008-07-22 Thread sundar shankar



 From: [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Subject: Out of memory on Solr sorting
 Date: Tue, 22 Jul 2008 19:11:02 +
 
 
 Hi,
 Sorry again fellos. I am not sure whats happening. The day with solr is bad 
 for me I guess. EZMLM didnt let me send any mails this morning. Asked me to 
 confirm subscription and when I did, it said I was already a member. Now my 
 mails are all coming out bad. Sorry for troubling y'all this bad. I hope this 
 mail comes out right.


Hi,
We are developing a product in a agile manner and the current 
implementation has a data of size just about a 800 megs in dev. 
The memory allocated to solr on dev (Dual core Linux box) is 128-512.

My config
=

   !-- autocommit pending docs if certain criteria are met
autoCommit
  maxDocs1/maxDocs
  maxTime1000/maxTime
/autoCommit
--

filterCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=256/

queryResultCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=256/

documentCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=0/

enableLazyFieldLoadingtrue/enableLazyFieldLoading


My Field
===

fieldType name=autocomplete class=solr.TextField   
analyzer type=index   
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z0-9]) replacement= replace=all /
filter class=solr.EdgeNGramFilterFactory 
maxGramSize=100 minGramSize=1 /  
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z0-9]) replacement= replace=all /
filter class=solr.PatternReplaceFilterFactory 
pattern=^(.{20})(.*)? replacement=$1 replace=all /
/analyzer
/fieldType


Problem
==

I execute a query that returns 24 rows of result. I pick 10 out of it. I have 
no problem when I execute this.
But When I do sort it by a String field that is fetched from this result. I get 
an OOM. I am able to execute several
other queries with no problem. Just having a sort asc clause added to the query 
throws an OOM. Why is that.
What should I have ideally done. My config on QA is pretty similar to the dev 
box and probably has more data than on dev. 
It didnt throw any OOM during the integration test. The Autocomplete is a new 
field we added recently.

Another point is that the indexing is done with a field of type string
 field name=XXX type=string indexed=true stored=true 
termVectors=true/

and the autocomplete field is a copy field.

The sorting is done based on string field.

Please do lemme know what mistake am I doing?

Regards
Sundar

P.S: The stack trace of the exception is


Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
query
 at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:86)
 at 
org.apache.solr.client.solrj.impl.BaseSolrServer.query(BaseSolrServer.java:101)
 at 
com.apollo.sisaw.solr.service.AbstractSolrSearchService.makeSolrQuery(AbstractSolrSearchService.java:193)
 ... 105 more
Caused by: org.apache.solr.common.SolrException: Java heap space  
java.lang.OutOfMemoryError: Java heap space 
at 
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403) 
 
at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)  
at 
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:352) 
 
at 
org.apache.lucene.search.FieldSortedHitQueue.comparatorString(FieldSortedHitQueue.java:416)
  
at 
org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:207)
  
at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)  
at 
org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:168)
  
at 
org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java:56)
  
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:907)
  
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:838)
  
at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:269)  
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:160)
  
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:156)
  
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:128)
  
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1025)  
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) 
 
at 

RE: Out of memory on Solr sorting

2008-07-22 Thread Fuad Efendi

org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)

- this piece of code do not request Array[100M] (as I seen with  
Lucene), it asks only few bytes / Kb for a field...



Probably 128 - 512 is not enough; it is also advisable to use equal sizes
-Xms1024M -Xmx1024M
(it minimizes GC frequency, and itensures that 1024M is available at startup)

OOM happens also with fragmented memory, when application requests big  
contigues fragment and GC is unable to optimize; looks like your  
application requests a little and memory is not available...



Quoting sundar shankar [EMAIL PROTECTED]:






From: [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Subject: Out of memory on Solr sorting
Date: Tue, 22 Jul 2008 19:11:02 +


Hi,
Sorry again fellos. I am not sure whats happening. The day with   
solr is bad for me I guess. EZMLM didnt let me send any mails this   
morning. Asked me to confirm subscription and when I did, it said I  
 was already a member. Now my mails are all coming out bad. Sorry   
for troubling y'all this bad. I hope this mail comes out right.



Hi,
We are developing a product in a agile manner and the current   
implementation has a data of size just about a 800 megs in dev.

The memory allocated to solr on dev (Dual core Linux box) is 128-512.

My config
=

   !-- autocommit pending docs if certain criteria are met
autoCommit
  maxDocs1/maxDocs
  maxTime1000/maxTime
/autoCommit
--

filterCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=256/

queryResultCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=256/

documentCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=0/

enableLazyFieldLoadingtrue/enableLazyFieldLoading


My Field
===

fieldType name=autocomplete class=solr.TextField
analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory  
 pattern=([^a-z0-9]) replacement= replace=all /
filter class=solr.EdgeNGramFilterFactory   
maxGramSize=100 minGramSize=1 /

/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory   
pattern=([^a-z0-9]) replacement= replace=all /
filter class=solr.PatternReplaceFilterFactory   
pattern=^(.{20})(.*)? replacement=$1 replace=all /

/analyzer
/fieldType


Problem
==

I execute a query that returns 24 rows of result. I pick 10 out of   
it. I have no problem when I execute this.
But When I do sort it by a String field that is fetched from this   
result. I get an OOM. I am able to execute several
other queries with no problem. Just having a sort asc clause added   
to the query throws an OOM. Why is that.
What should I have ideally done. My config on QA is pretty similar   
to the dev box and probably has more data than on dev.
It didnt throw any OOM during the integration test. The Autocomplete  
 is a new field we added recently.


Another point is that the indexing is done with a field of type string
 field name=XXX type=string indexed=true stored=true   
termVectors=true/


and the autocomplete field is a copy field.

The sorting is done based on string field.

Please do lemme know what mistake am I doing?

Regards
Sundar

P.S: The stack trace of the exception is


Caused by: org.apache.solr.client.solrj.SolrServerException: Error   
executing query
 at   
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:86)
 at   
org.apache.solr.client.solrj.impl.BaseSolrServer.query(BaseSolrServer.java:101)
 at   
com.apollo.sisaw.solr.service.AbstractSolrSearchService.makeSolrQuery(AbstractSolrSearchService.java:193)

 ... 105 more
Caused by: org.apache.solr.common.SolrException: Java heap space
java.lang.OutOfMemoryError: Java heap space
at   
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)

at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
at   
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:352)
at   
org.apache.lucene.search.FieldSortedHitQueue.comparatorString(FieldSortedHitQueue.java:416)
at   
org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:207)

at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
at   
org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:168)
at   
org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java:56)
at   
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:907)
at   

RE: Out of memory on Solr sorting

2008-07-22 Thread sundar shankar
Thanks Fuad.
  But why does just sorting provide an OOM. I executed the 
query without adding the sort clause it executed perfectly. In fact I even 
tried remove the maxrows=10 and executed. it came out fine. Queries with bigger 
results seems to come out fine too. But why just sort of that too just 10 rows??
 
-Sundar



 Date: Tue, 22 Jul 2008 12:24:35 -0700 From: [EMAIL PROTECTED] To: 
 solr-user@lucene.apache.org Subject: RE: Out of memory on Solr sorting  
 org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)
   - this piece of code do not request Array[100M] (as I seen with  Lucene), 
 it asks only few bytes / Kb for a field...   Probably 128 - 512 is not 
 enough; it is also advisable to use equal sizes -Xms1024M -Xmx1024M (it 
 minimizes GC frequency, and itensures that 1024M is available at startup)  
 OOM happens also with fragmented memory, when application requests big  
 contigues fragment and GC is unable to optimize; looks like your  
 application requests a little and memory is not available...   Quoting 
 sundar shankar [EMAIL PROTECTED]:  From: [EMAIL PROTECTED] 
  To: solr-user@lucene.apache.org  Subject: Out of memory on Solr 
 sorting  Date: Tue, 22 Jul 2008 19:11:02 +Hi,  Sorry 
 again fellos. I am not sure whats happening. The day with   solr is bad 
 for me I guess. EZMLM didnt let me send any mails this   morning. Asked me 
 to confirm subscription and when I did, it said I   was already a member. 
 Now my mails are all coming out bad. Sorry   for troubling y'all this bad. 
 I hope this mail comes out right.Hi,  We are developing a product 
 in a agile manner and the current   implementation has a data of size just 
 about a 800 megs in dev.  The memory allocated to solr on dev (Dual core 
 Linux box) is 128-512.   My config  =   !-- autocommit 
 pending docs if certain criteria are met  autoCommit  
 maxDocs1/maxDocs  maxTime1000/maxTime  /autoCommit  -- 
   filterCache  class=solr.LRUCache  size=512  
 initialSize=512  autowarmCount=256/   queryResultCache  
 class=solr.LRUCache  size=512  initialSize=512  
 autowarmCount=256/   documentCache  class=solr.LRUCache  
 size=512  initialSize=512  autowarmCount=0/   
 enableLazyFieldLoadingtrue/enableLazyFieldLoadingMy Field  
 ===   fieldType name=autocomplete class=solr.TextField  
 analyzer type=index  tokenizer class=solr.KeywordTokenizerFactory/ 
  filter class=solr.LowerCaseFilterFactory /  filter 
 class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) 
 replacement= replace=all /  filter 
 class=solr.EdgeNGramFilterFactory   maxGramSize=100 minGramSize=1 / 
  /analyzer  analyzer type=query  tokenizer 
 class=solr.KeywordTokenizerFactory/  filter 
 class=solr.LowerCaseFilterFactory /  filter 
 class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) 
 replacement= replace=all /  filter 
 class=solr.PatternReplaceFilterFactory   pattern=^(.{20})(.*)? 
 replacement=$1 replace=all /  /analyzer  /fieldType
 Problem  ==   I execute a query that returns 24 rows of result. I 
 pick 10 out of   it. I have no problem when I execute this.  But When I 
 do sort it by a String field that is fetched from this   result. I get an 
 OOM. I am able to execute several  other queries with no problem. Just 
 having a sort asc clause added   to the query throws an OOM. Why is that. 
  What should I have ideally done. My config on QA is pretty similar   to 
 the dev box and probably has more data than on dev.  It didnt throw any OOM 
 during the integration test. The Autocomplete   is a new field we added 
 recently.   Another point is that the indexing is done with a field of 
 type string  field name=XXX type=string indexed=true stored=true  
  termVectors=true/   and the autocomplete field is a copy field.  
  The sorting is done based on string field.   Please do lemme know what 
 mistake am I doing?   Regards  Sundar   P.S: The stack trace of the 
 exception isCaused by: 
 org.apache.solr.client.solrj.SolrServerException: Error   executing query 
  at   
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:86)
   at   
 org.apache.solr.client.solrj.impl.BaseSolrServer.query(BaseSolrServer.java:101)
   at   
 com.apollo.sisaw.solr.service.AbstractSolrSearchService.makeSolrQuery(AbstractSolrSearchService.java:193)
   ... 105 more  Caused by: org.apache.solr.common.SolrException: Java heap 
 space   java.lang.OutOfMemoryError: Java heap space  at   
 org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)
   at 
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)  
 at   
 org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:352)
   at   
 org.apache.lucene.search.FieldSortedHitQueue.comparatorString(FieldSortedHitQueue.java:416)
   at   
 org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:207

Re: Out of memory on Solr sorting

2008-07-22 Thread Mark Miller
Because to sort efficiently, Solr loads the term to sort on for each doc 
in the index into an array. For ints,longs, etc its just an array the 
size of the number of docs in your index (i believe deleted or not). For 
a String its an array to hold each unique string and an array of ints 
indexing into the String array.


So if you do a sort, and search for something that only gets 1 doc as a 
hit...your still loading up that field cache for every single doc in 
your index on the first search. With solr, this happens in the 
background as it warms up the searcher. The end story is, you need more 
RAM to accommodate the sort most likely...have you upped your xmx 
setting? I think you can roughly say a 2 million doc index would need 
40-50 MB (depending and rough, but to give an idea) per field your 
sorting on.


- Mark

sundar shankar wrote:

Thanks Fuad.
  But why does just sorting provide an OOM. I executed the 
query without adding the sort clause it executed perfectly. In fact I even 
tried remove the maxrows=10 and executed. it came out fine. Queries with bigger 
results seems to come out fine too. But why just sort of that too just 10 rows??
 
-Sundar




  
Date: Tue, 22 Jul 2008 12:24:35 -0700 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: RE: Out of memory on Solr sorting  org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)  - this piece of code do not request Array[100M] (as I seen with  Lucene), it asks only few bytes / Kb for a field...   Probably 128 - 512 is not enough; it is also advisable to use equal sizes -Xms1024M -Xmx1024M (it minimizes GC frequency, and itensures that 1024M is available at startup)  OOM happens also with fragmented memory, when application requests big  contigues fragment and GC is unable to optimize; looks like your  application requests a little and memory is not available...   Quoting sundar shankar [EMAIL PROTECTED]:  From: [EMAIL PROTECTED]  To: solr-user@lucene.apache.org  Subject: Out of memory on Solr sorting  Date: Tue, 22 Jul 2008 19:11:02 +Hi,  Sorry again fellos. I am not sure whats happening. The day with   solr is bad for me I guess. EZMLM didnt let me send any mails this   morning. Asked me to confirm subscription and when I did, it said I   was already a member. Now my mails are all coming out bad. Sorry   for troubling y'all this bad. I hope this mail comes out right.Hi,  We are developing a product in a agile manner and the current   implementation has a data of size just about a 800 megs in dev.  The memory allocated to solr on dev (Dual core Linux box) is 128-512.   My config  =   !-- autocommit pending docs if certain criteria are met  autoCommit  maxDocs1/maxDocs  maxTime1000/maxTime  /autoCommit  --   filterCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=256/   queryResultCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=256/   documentCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=0/   enableLazyFieldLoadingtrue/enableLazyFieldLoadingMy Field  ===   fieldType name=autocomplete class=solr.TextField  analyzer type=index  tokenizer class=solr.KeywordTokenizerFactory/  filter class=solr.LowerCaseFilterFactory /  filter class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) replacement= replace=all /  filter class=solr.EdgeNGramFilterFactory   maxGramSize=100 minGramSize=1 /  /analyzer  analyzer type=query  tokenizer class=solr.KeywordTokenizerFactory/  filter class=solr.LowerCaseFilterFactory /  filter class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) replacement= replace=all /  filter class=solr.PatternReplaceFilterFactory   pattern=^(.{20})(.*)? replacement=$1 replace=all /  /analyzer  /fieldTypeProblem  ==   I execute a query that returns 24 rows of result. I pick 10 out of   it. I have no problem when I execute this.  But When I do sort it by a String field that is fetched from this   result. I get an OOM. I am able to execute several  other queries with no problem. Just having a sort asc clause added   to the query throws an OOM. Why is that.  What should I have ideally done. My config on QA is pretty similar   to the dev box and probably has more data than on dev.  It didnt throw any OOM during the integration test. The Autocomplete   is a new field we added recently.   Another point is that the indexing is done with a field of type string  field name=XXX type=string indexed=true stored=true   termVectors=true/   and the autocomplete field is a copy field.   The sorting is done based on string field.   Please do lemme know what mistake am I doing?   Regards  Sundar   P.S: The stack trace of the exception isCaused by: org.apache.solr.client.solrj.SolrServerException: Error   executing query  at   org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:86)  at   org.apache.solr.client.solrj.impl.BaseSolrServer.query(BaseSolrServer.java:101

Re: Out of memory on Solr sorting

2008-07-22 Thread Fuad Efendi
I've even seen exceptions (posted here) when sort-type queries  
caused Lucene to allocate 100Mb arrays, here is what happened to me:


SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979
at
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:360)
at  
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)





- it does not happen after I increased from 4096M to 8192M (JRockit  
R27; more intelligent stacktrace, isn't it?)


Thanks Mark; I didn't know that it happens only once (on warming up a  
searcher).




Quoting Mark Miller [EMAIL PROTECTED]:


Because to sort efficiently, Solr loads the term to sort on for each
doc in the index into an array. For ints,longs, etc its just an array
the size of the number of docs in your index (i believe deleted or
not). For a String its an array to hold each unique string and an array
of ints indexing into the String array.

So if you do a sort, and search for something that only gets 1 doc as a
hit...your still loading up that field cache for every single doc in
your index on the first search. With solr, this happens in the
background as it warms up the searcher. The end story is, you need more
RAM to accommodate the sort most likely...have you upped your xmx
setting? I think you can roughly say a 2 million doc index would need
40-50 MB (depending and rough, but to give an idea) per field your
sorting on.

- Mark

sundar shankar wrote:

Thanks Fuad.
 But why does just sorting provide an OOM. I   
executed the query without adding the sort clause it executed   
perfectly. In fact I even tried remove the maxrows=10 and executed.  
 it came out fine. Queries with bigger results seems to come out   
fine too. But why just sort of that too just 10 rows??

-Sundar




Date: Tue, 22 Jul 2008 12:24:35 -0700 From: [EMAIL PROTECTED] To:   
solr-user@lucene.apache.org Subject: RE: Out of memory on Solr   
sorting
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)  - this piece of code do not request Array[100M] (as I seen with  Lucene), it asks only few bytes / Kb for a field...   Probably 128 - 512 is not enough; it is also advisable to use equal sizes -Xms1024M -Xmx1024M (it minimizes GC frequency, and itensures that 1024M is available at startup)  OOM happens also with fragmented memory, when application requests big  contigues fragment and GC is unable to optimize; looks like your  application requests a little and memory is not available...   Quoting sundar shankar [EMAIL PROTECTED]:  From: [EMAIL PROTECTED]  To: solr-user@lucene.apache.org  Subject: Out of memory on Solr sorting  Date: Tue, 22 Jul 2008 19:11:02 +Hi,  Sorry again fellos. I am not sure whats happening. The day with   solr is bad for me I guess. EZMLM didnt let me send any mails this   morning. Asked me to confirm subscription and when I did, it said I   was already a member. Now my mails are all coming out bad. Sorry   for troubling y'all this bad. I hope this mail comes out right.Hi,  We are developing a product in a agile manner and the current   implementation has a data of size just about a 800 megs in dev.  The memory allocated to solr on dev (Dual core Linux box) is 128-512.   My config  =   !-- autocommit pending docs if certain criteria are met  autoCommit  maxDocs1/maxDocs  maxTime1000/maxTime  /autoCommit  --   filterCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=256/   queryResultCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=256/   documentCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=0/   enableLazyFieldLoadingtrue/enableLazyFieldLoadingMy Field  ===   fieldType name=autocomplete class=solr.TextField  analyzer type=index  tokenizer class=solr.KeywordTokenizerFactory/  filter class=solr.LowerCaseFilterFactory /  filter class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) replacement= replace=all /  filter class=solr.EdgeNGramFilterFactory   maxGramSize=100 minGramSize=1 /  /analyzer  analyzer type=query  tokenizer class=solr.KeywordTokenizerFactory/  filter class=solr.LowerCaseFilterFactory /  filter class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) replacement= replace=all /  filter class=solr.PatternReplaceFilterFactory   pattern=^(.{20})(.*)? replacement=$1 replace=all /  /analyzer  /fieldTypeProblem  ==   I execute a query that returns 24 rows of result. I pick 10 out of   it. I have no problem when I execute this.  But When I do sort it by a String field that is fetched from this   result. I get an OOM. I am able to execute several  other queries with no problem. Just having a sort asc clause added   to the query throws an OOM. Why is that.  What should I have ideally done. My config on QA is pretty similar   to the dev box and probably has more data than on dev.  It didnt throw any OOM during

Re: Out of memory on Solr sorting

2008-07-22 Thread Fuad Efendi

SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979



I just noticed, this is an exact number of documents in index: 25191979

(http://www.tokenizer.org/, you can sort - click headers Id, [COuntry,  
Site, Price] in a table; experimental)



If array is allocated ONLY on new searcher warming up I am _extremely_  
happy... I had constant OOMs during past month (SUN Java 5).




Quoting Fuad Efendi [EMAIL PROTECTED]:


I've even seen exceptions (posted here) when sort-type queries caused
Lucene to allocate 100Mb arrays, here is what happened to me:

SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979
at
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:360)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)




- it does not happen after I increased from 4096M to 8192M (JRockit
R27; more intelligent stacktrace, isn't it?)

Thanks Mark; I didn't know that it happens only once (on warming up a
searcher).



Quoting Mark Miller [EMAIL PROTECTED]:


Because to sort efficiently, Solr loads the term to sort on for each
doc in the index into an array. For ints,longs, etc its just an array
the size of the number of docs in your index (i believe deleted or
not). For a String its an array to hold each unique string and an array
of ints indexing into the String array.

So if you do a sort, and search for something that only gets 1 doc as a
hit...your still loading up that field cache for every single doc in
your index on the first search. With solr, this happens in the
background as it warms up the searcher. The end story is, you need more
RAM to accommodate the sort most likely...have you upped your xmx
setting? I think you can roughly say a 2 million doc index would need
40-50 MB (depending and rough, but to give an idea) per field your
sorting on.

- Mark

sundar shankar wrote:

Thanks Fuad.
But why does just sorting provide an OOM. I
executed the query without adding the sort clause it executed
perfectly. In fact I even tried remove the maxrows=10 and   
executed.  it came out fine. Queries with bigger results seems to   
come out  fine too. But why just sort of that too just 10 rows??

-Sundar




Date: Tue, 22 Jul 2008 12:24:35 -0700 From: [EMAIL PROTECTED] To:   
 solr-user@lucene.apache.org Subject: RE: Out of memory on Solr   
 sorting 
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)  - this piece of code do not request Array[100M] (as I seen with  Lucene), it asks only few bytes / Kb for a field...   Probably 128 - 512 is not enough; it is also advisable to use equal sizes -Xms1024M -Xmx1024M (it minimizes GC frequency, and itensures that 1024M is available at startup)  OOM happens also with fragmented memory, when application requests big  contigues fragment and GC is unable to optimize; looks like your  application requests a little and memory is not available...   Quoting sundar shankar [EMAIL PROTECTED]:  From: [EMAIL PROTECTED]  To: solr-user@lucene.apache.org  Subject: Out of memory on Solr sorting  Date: Tue, 22 Jul 2008 19:11:02 +Hi,  Sorry again fellos. I am not sure whats happening. The day with   solr is bad for me I guess. EZMLM didnt let me send any mails this   morning. Asked me to confirm subscription and when I did, it said I   was already a member. Now my mails are all coming out bad. Sorry   for troubling y'all this bad. I hope this mail comes out right.Hi,  We are developing a product in a agile manner and the current   implementation has a data of size just about a 800 megs in dev.  The memory allocated to solr on dev (Dual core Linux box) is 128-512.   My config  =   !-- autocommit pending docs if certain criteria are met  autoCommit  maxDocs1/maxDocs  maxTime1000/maxTime  /autoCommit  --   filterCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=256/   queryResultCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=256/   documentCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=0/   enableLazyFieldLoadingtrue/enableLazyFieldLoadingMy Field  ===   fieldType name=autocomplete class=solr.TextField  analyzer type=index  tokenizer class=solr.KeywordTokenizerFactory/  filter class=solr.LowerCaseFilterFactory /  filter class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) replacement= replace=all /  filter class=solr.EdgeNGramFilterFactory   maxGramSize=100 minGramSize=1 /  /analyzer  analyzer type=query  tokenizer class=solr.KeywordTokenizerFactory/  filter class=solr.LowerCaseFilterFactory /  filter class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) replacement= replace=all /  filter class=solr.PatternReplaceFilterFactory   pattern=^(.{20})(.*)? replacement=$1 replace=all /  /analyzer  /fieldTypeProblem  ==   I execute a query that returns

RE: Out of memory on Solr sorting

2008-07-22 Thread sundar shankar

Thanks for the explanation mark. The reason I had it as 512 max was cos earlier 
the data file was just about 30 megs and it increased to this much for of the 
usage of EdgeNGramFactoryFilter for 2 fields. Thats great to know it just 
happens for the first search. But this exception has been occuring for me for 
the whole of today. Should I fiddle around with the warmer settings too? I have 
also instructed an increase in Heap to 1024. Will keep you posted on the turn 
arounds.

Thanks
-Sundar


 Date: Tue, 22 Jul 2008 15:46:04 -0400
 From: [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Subject: Re: Out of memory on Solr sorting
 
 Because to sort efficiently, Solr loads the term to sort on for each doc 
 in the index into an array. For ints,longs, etc its just an array the 
 size of the number of docs in your index (i believe deleted or not). For 
 a String its an array to hold each unique string and an array of ints 
 indexing into the String array.
 
 So if you do a sort, and search for something that only gets 1 doc as a 
 hit...your still loading up that field cache for every single doc in 
 your index on the first search. With solr, this happens in the 
 background as it warms up the searcher. The end story is, you need more 
 RAM to accommodate the sort most likely...have you upped your xmx 
 setting? I think you can roughly say a 2 million doc index would need 
 40-50 MB (depending and rough, but to give an idea) per field your 
 sorting on.
 
 - Mark
 
 sundar shankar wrote:
  Thanks Fuad.
But why does just sorting provide an OOM. I executed the 
  query without adding the sort clause it executed perfectly. In fact I even 
  tried remove the maxrows=10 and executed. it came out fine. Queries with 
  bigger results seems to come out fine too. But why just sort of that too 
  just 10 rows??
   
  -Sundar
 
 
 

  Date: Tue, 22 Jul 2008 12:24:35 -0700 From: [EMAIL PROTECTED] To: 
  solr-user@lucene.apache.org Subject: RE: Out of memory on Solr sorting  
  org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)
- this piece of code do not request Array[100M] (as I seen with  
  Lucene), it asks only few bytes / Kb for a field...   Probably 128 - 
  512 is not enough; it is also advisable to use equal sizes -Xms1024M 
  -Xmx1024M (it minimizes GC frequency, and itensures that 1024M is 
  available at startup)  OOM happens also with fragmented memory, when 
  application requests big  contigues fragment and GC is unable to 
  optimize; looks like your  application requests a little and memory is 
  not available...   Quoting sundar shankar [EMAIL PROTECTED]:
From: [EMAIL PROTECTED]  To: solr-user@lucene.apache.org  
  Subject: Out of memory on Solr sorting  Date: Tue, 22 Jul 2008 19:11:02 
  +Hi,  Sorry again fellos. I am not sure whats 
  happening. The day with   solr is bad for me I guess. EZMLM didnt let 
  me send any mails this   morning. Asked me to confirm subscription and 
  when I did, it said I   was already a member. Now my mails are all 
  coming out bad. Sorry   for troubling y'all this bad. I hope this mail 
  comes out right.Hi,  We are developing a product in a agile 
  manner and the current   implementation has a data of size just about a 
  800 megs in dev.  The memory allocated to solr on dev (Dual core Linux 
  box) is 128-512.   My config  =   !-- autocommit 
  pending docs if certain criteria are met  autoCommit  
  maxDocs1/maxDocs  maxTime1000/maxTime  /autoCommit  
  --   filterCache  class=solr.LRUCache  size=512  
  initialSize=512  autowarmCount=256/   queryResultCache  
  class=solr.LRUCache  size=512  initialSize=512  
  autowarmCount=256/   documentCache  class=solr.LRUCache  
  size=512  initialSize=512  autowarmCount=0/   
  enableLazyFieldLoadingtrue/enableLazyFieldLoadingMy Field  
  ===   fieldType name=autocomplete class=solr.TextField  
  analyzer type=index  tokenizer 
  class=solr.KeywordTokenizerFactory/  filter 
  class=solr.LowerCaseFilterFactory /  filter 
  class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) 
  replacement= replace=all /  filter 
  class=solr.EdgeNGramFilterFactory   maxGramSize=100 minGramSize=1 
  /  /analyzer  analyzer type=query  tokenizer 
  class=solr.KeywordTokenizerFactory/  filter 
  class=solr.LowerCaseFilterFactory /  filter 
  class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) 
  replacement= replace=all /  filter 
  class=solr.PatternReplaceFilterFactory   pattern=^(.{20})(.*)? 
  replacement=$1 replace=all /  /analyzer  /fieldType
  Problem  ==   I execute a query that returns 24 rows of result. 
  I pick 10 out of   it. I have no problem when I execute this.  But 
  When I do sort it by a String field that is fetched from this   result. 
  I get an OOM. I am able to execute several  other queries with no 
  problem. Just having a sort asc clause added   to the query throws an 
  OOM. Why is that.  What

RE: Out of memory on Solr sorting

2008-07-22 Thread sundar shankar
Sorry, Not 30, but 300 :)

From: [EMAIL PROTECTED]: [EMAIL PROTECTED]: RE: Out of memory on Solr 
sortingDate: Tue, 22 Jul 2008 20:19:49 +


Thanks for the explanation mark. The reason I had it as 512 max was cos earlier 
the data file was just about 30 megs and it increased to this much for of the 
usage of EdgeNGramFactoryFilter for 2 fields. Thats great to know it just 
happens for the first search. But this exception has been occuring for me for 
the whole of today. Should I fiddle around with the warmer settings too? I have 
also instructed an increase in Heap to 1024. Will keep you posted on the turn 
arounds.Thanks-Sundar Date: Tue, 22 Jul 2008 15:46:04 -0400 From: [EMAIL 
PROTECTED] To: solr-user@lucene.apache.org Subject: Re: Out of memory on Solr 
sorting  Because to sort efficiently, Solr loads the term to sort on for each 
doc  in the index into an array. For ints,longs, etc its just an array the  
size of the number of docs in your index (i believe deleted or not). For  a 
String its an array to hold each unique string and an array of ints  indexing 
into the String array.  So if you do a sort, and search for something that 
only gets 1 doc as a  hit...your still loading up that field cache for every 
single doc in  your index on the first search. With solr, this happens in the 
 background as it warms up the searcher. The end story is, you need more  RAM 
to accommodate the sort most likely...have you upped your xmx  setting? I 
think you can roughly say a 2 million doc index would need  40-50 MB 
(depending and rough, but to give an idea) per field your  sorting on.  - 
Mark  sundar shankar wrote:  Thanks Fuad.  But why does just sorting 
provide an OOM. I executed the query without adding the sort clause it executed 
perfectly. In fact I even tried remove the maxrows=10 and executed. it came out 
fine. Queries with bigger results seems to come out fine too. But why just sort 
of that too just 10 rows??-Sundar   Date: Tue, 22 Jul 
2008 12:24:35 -0700 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org 
Subject: RE: Out of memory on Solr sorting  
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)
  - this piece of code do not request Array[100M] (as I seen with  Lucene), 
it asks only few bytes / Kb for a field...   Probably 128 - 512 is not 
enough; it is also advisable to use equal sizes -Xms1024M -Xmx1024M (it 
minimizes GC frequency, and itensures that 1024M is available at startup)  
OOM happens also with fragmented memory, when application requests big  
contigues fragment and GC is unable to optimize; looks like your  application 
requests a little and memory is not available...   Quoting sundar shankar 
[EMAIL PROTECTED]:  From: [EMAIL PROTECTED]  To: 
solr-user@lucene.apache.org  Subject: Out of memory on Solr sorting  
Date: Tue, 22 Jul 2008 19:11:02 +Hi,  Sorry again fellos. I 
am not sure whats happening. The day with   solr is bad for me I guess. 
EZMLM didnt let me send any mails this   morning. Asked me to confirm 
subscription and when I did, it said I   was already a member. Now my mails 
are all coming out bad. Sorry   for troubling y'all this bad. I hope this 
mail comes out right.Hi,  We are developing a product in a agile 
manner and the current   implementation has a data of size just about a 800 
megs in dev.  The memory allocated to solr on dev (Dual core Linux box) is 
128-512.   My config  =   !-- autocommit pending docs if 
certain criteria are met  autoCommit  maxDocs1/maxDocs  
maxTime1000/maxTime  /autoCommit  --   filterCache  
class=solr.LRUCache  size=512  initialSize=512  
autowarmCount=256/   queryResultCache  class=solr.LRUCache  
size=512  initialSize=512  autowarmCount=256/   documentCache 
 class=solr.LRUCache  size=512  initialSize=512  
autowarmCount=0/   
enableLazyFieldLoadingtrue/enableLazyFieldLoadingMy Field  
===   fieldType name=autocomplete class=solr.TextField  
analyzer type=index  tokenizer class=solr.KeywordTokenizerFactory/  
filter class=solr.LowerCaseFilterFactory /  filter 
class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) 
replacement= replace=all /  filter class=solr.EdgeNGramFilterFactory 
  maxGramSize=100 minGramSize=1 /  /analyzer  analyzer 
type=query  tokenizer class=solr.KeywordTokenizerFactory/  filter 
class=solr.LowerCaseFilterFactory /  filter 
class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) 
replacement= replace=all /  filter 
class=solr.PatternReplaceFilterFactory   pattern=^(.{20})(.*)? 
replacement=$1 replace=all /  /analyzer  /fieldType
Problem  ==   I execute a query that returns 24 rows of result. I 
pick 10 out of   it. I have no problem when I execute this.  But When I do 
sort it by a String field that is fetched from this   result. I get an OOM. I 
am able to execute several  other queries with no problem. Just having a sort 
asc clause added   to the query throws an OOM. Why is that.  What should I

Re: Out of memory on Solr sorting

2008-07-22 Thread Mark Miller

Fuad Efendi wrote:

SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979



I just noticed, this is an exact number of documents in index: 25191979

(http://www.tokenizer.org/, you can sort - click headers Id, [COuntry, 
Site, Price] in a table; experimental)



If array is allocated ONLY on new searcher warming up I am _extremely_ 
happy... I had constant OOMs during past month (SUN Java 5).
It is only on warmup - I believe its lazy loaded, so the first time a 
search is done (solr does the search as part of warmup I believe) the 
fieldcache is loaded. The underlying IndexReader is the key to the 
fieldcache, so until you get a new IndexReader (SolrSearcher in solr 
world?) the field cache will be good. Keep in mind that as a searcher is 
warming, the other search is still serving, so that will up the ram 
requirements...and since I think you can have 1 searchers on deck...you 
get the idea.


As far as the number I gave, thats from a memory made months and months 
ago, so go with what you see.




Quoting Fuad Efendi [EMAIL PROTECTED]:


I've even seen exceptions (posted here) when sort-type queries caused
Lucene to allocate 100Mb arrays, here is what happened to me:

SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979
at
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:360) 


at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72) 






- it does not happen after I increased from 4096M to 8192M (JRockit
R27; more intelligent stacktrace, isn't it?)

Thanks Mark; I didn't know that it happens only once (on warming up a
searcher).



Quoting Mark Miller [EMAIL PROTECTED]:


Because to sort efficiently, Solr loads the term to sort on for each
doc in the index into an array. For ints,longs, etc its just an array
the size of the number of docs in your index (i believe deleted or
not). For a String its an array to hold each unique string and an array
of ints indexing into the String array.

So if you do a sort, and search for something that only gets 1 doc as a
hit...your still loading up that field cache for every single doc in
your index on the first search. With solr, this happens in the
background as it warms up the searcher. The end story is, you need more
RAM to accommodate the sort most likely...have you upped your xmx
setting? I think you can roughly say a 2 million doc index would need
40-50 MB (depending and rough, but to give an idea) per field your
sorting on.

- Mark

sundar shankar wrote:

Thanks Fuad.
But why does just sorting provide an OOM. I   
executed the query without adding the sort clause it executed   
perfectly. In fact I even tried remove the maxrows=10 and  
executed.  it came out fine. Queries with bigger results seems to  
come out  fine too. But why just sort of that too just 10 rows??

-Sundar




Date: Tue, 22 Jul 2008 12:24:35 -0700 From: [EMAIL PROTECTED] To:  
 solr-user@lucene.apache.org Subject: RE: Out of memory on Solr  
 sorting
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403) 
 - this piece of code do not request Array[100M] (as I seen with 
 Lucene), it asks only few bytes / Kb for a field...   
Probably 128 - 512 is not enough; it is also advisable to use 
equal sizes -Xms1024M -Xmx1024M (it minimizes GC frequency, and 
itensures that 1024M is available at startup)  OOM happens also 
with fragmented memory, when application requests big  contigues 
fragment and GC is unable to optimize; looks like your  
application requests a little and memory is not available...   
Quoting sundar shankar [EMAIL PROTECTED]:  
From: [EMAIL PROTECTED]  To: 
solr-user@lucene.apache.org  Subject: Out of memory on Solr 
sorting  Date: Tue, 22 Jul 2008 19:11:02 +Hi, 
 Sorry again fellos. I am not sure whats happening. The day with 
  solr is bad for me I guess. EZMLM didnt let me send any mails 
this   morning. Asked me to confirm subscription and when I 
did, it said I   was already a member. Now my mails are all 
coming out bad. Sorry   for troubling y'all this bad. I hope 
this mail comes out right.Hi,  We are developing a 
product in a agile manner and the current   implementation has a 
data of size just about a 800 megs in dev.  The memory allocated 
to solr on dev (Dual core Linux box) is 128-512.   My config 
 =   !-- autocommit pending docs if certain criteria 
are met  autoCommit  maxDocs1/maxDocs  
maxTime1000/maxTime  /autoCommit  --   
filterCache  class=solr.LRUCache  size=512  
initialSize=512  autowarmCount=256/   
queryResultCache  class=solr.LRUCache  size=512  
initialSize=512  autowarmCount=256/   documentCache  
class=solr.LRUCache  size=512  initialSize=512  
autowarmCount=0/   
enableLazyFieldLoadingtrue/enableLazyFieldLoadingMy 
Field  ===   fieldType name=autocomplete 
class=solr.TextField  analyzer

Re: Out of memory on Solr sorting

2008-07-22 Thread Fuad Efendi

Mark,

Question: how much memory I need for 25,000,000 docs if I do a sort by  
string field, 256 bytes. 6.4Gb?



Quoting Mark Miller [EMAIL PROTECTED]:


Because to sort efficiently, Solr loads the term to sort on for each
doc in the index into an array. For ints,longs, etc its just an array
the size of the number of docs in your index (i believe deleted or
not). For a String its an array to hold each unique string and an array
of ints indexing into the String array.

So if you do a sort, and search for something that only gets 1 doc as a
hit...your still loading up that field cache for every single doc in
your index on the first search. With solr, this happens in the
background as it warms up the searcher. The end story is, you need more
RAM to accommodate the sort most likely...have you upped your xmx
setting? I think you can roughly say a 2 million doc index would need
40-50 MB (depending and rough, but to give an idea) per field your
sorting on.

- Mark







Re: Out of memory on Solr sorting

2008-07-22 Thread Fuad Efendi

Thank you very much Mark,

it explains me a lot;

I am guessing: for 1,000,000 documents with a [string] field of  
average size 1024 bytes I need 1Gb for single IndexSearcher instance;  
field-level cache it is used internally by Lucene (can Lucene manage  
size if it?); we can't have 1G of such documents without having 1Tb  
RAM...




Quoting Mark Miller [EMAIL PROTECTED]:


Fuad Efendi wrote:

SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979



I just noticed, this is an exact number of documents in index: 25191979

(http://www.tokenizer.org/, you can sort - click headers Id,   
[COuntry, Site, Price] in a table; experimental)



If array is allocated ONLY on new searcher warming up I am   
_extremely_ happy... I had constant OOMs during past month (SUN   
Java 5).

It is only on warmup - I believe its lazy loaded, so the first time a
search is done (solr does the search as part of warmup I believe) the
fieldcache is loaded. The underlying IndexReader is the key to the
fieldcache, so until you get a new IndexReader (SolrSearcher in solr
world?) the field cache will be good. Keep in mind that as a searcher
is warming, the other search is still serving, so that will up the ram
requirements...and since I think you can have 1 searchers on
deck...you get the idea.

As far as the number I gave, thats from a memory made months and months
ago, so go with what you see.




Quoting Fuad Efendi [EMAIL PROTECTED]:


I've even seen exceptions (posted here) when sort-type queries caused
Lucene to allocate 100Mb arrays, here is what happened to me:

SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979
   at
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:360)   
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)   
- it does not happen after I increased from 4096M to 8192M (JRockit

R27; more intelligent stacktrace, isn't it?)

Thanks Mark; I didn't know that it happens only once (on warming up a
searcher).



Quoting Mark Miller [EMAIL PROTECTED]:


Because to sort efficiently, Solr loads the term to sort on for each
doc in the index into an array. For ints,longs, etc its just an array
the size of the number of docs in your index (i believe deleted or
not). For a String its an array to hold each unique string and an array
of ints indexing into the String array.

So if you do a sort, and search for something that only gets 1 doc as a
hit...your still loading up that field cache for every single doc in
your index on the first search. With solr, this happens in the
background as it warms up the searcher. The end story is, you need more
RAM to accommodate the sort most likely...have you upped your xmx
setting? I think you can roughly say a 2 million doc index would need
40-50 MB (depending and rough, but to give an idea) per field your
sorting on.

- Mark

sundar shankar wrote:

Thanks Fuad.
   But why does just sorting provide an OOM. I 
executed the query without adding the sort clause it executed 
perfectly. In fact I even tried remove the maxrows=10 and
executed.  it came out fine. Queries with bigger results seems   
to  come out  fine too. But why just sort of that too just 10   
rows??

-Sundar




Date: Tue, 22 Jul 2008 12:24:35 -0700 From: [EMAIL PROTECTED]   
To:   solr-user@lucene.apache.org Subject: RE: Out of memory   
on Solr   sorting  
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)  - this piece of code do not request Array[100M] (as I seen with  Lucene), it asks only few bytes / Kb for a field...   Probably 128 - 512 is not enough; it is also advisable to use equal sizes -Xms1024M -Xmx1024M (it minimizes GC frequency, and itensures that 1024M is available at startup)  OOM happens also with fragmented memory, when application requests big  contigues fragment and GC is unable to optimize; looks like your  application requests a little and memory is not available...   Quoting sundar shankar [EMAIL PROTECTED]:  From: [EMAIL PROTECTED]  To: solr-user@lucene.apache.org  Subject: Out of memory on Solr sorting  Date: Tue, 22 Jul 2008 19:11:02 +Hi,  Sorry again fellos. I am not sure whats happening. The day with   solr is bad for me I guess. EZMLM didnt let me send any mails this   morning. Asked me to confirm subscription and when I did, it said I   was already a member. Now my mails are all coming out bad. Sorry   for troubling y'all this bad. I hope this mail comes out right.Hi,  We are developing a product in a agile manner and the current   implementation has a data of size just about a 800 megs in dev.  The memory allocated to solr on dev (Dual core Linux box) is 128-512.   My config  =   !-- autocommit pending docs if certain criteria are met  autoCommit  maxDocs1/maxDocs  maxTime1000/maxTime  /autoCommit  --   filterCache  class

Re: Out of memory on Solr sorting

2008-07-22 Thread Mark Miller

Hmmm...I think its 32bits an integer with an index entry for each doc, so


   **25 000 000 x 32 bits = 95.3674316 megabytes**

Then you have the string array that contains each unique term from your 
index...you can guess that based on the number of terms in your index 
and an avg length guess.


There is some other overhead beyond the sort cache as well, but thats 
the bulk of what it will add. I think my memory may be bad with my 
original estimate :)


Fuad Efendi wrote:

Thank you very much Mark,

it explains me a lot;

I am guessing: for 1,000,000 documents with a [string] field of 
average size 1024 bytes I need 1Gb for single IndexSearcher instance; 
field-level cache it is used internally by Lucene (can Lucene manage 
size if it?); we can't have 1G of such documents without having 1Tb 
RAM...




Quoting Mark Miller [EMAIL PROTECTED]:


Fuad Efendi wrote:

SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979



I just noticed, this is an exact number of documents in index: 25191979

(http://www.tokenizer.org/, you can sort - click headers Id,  
[COuntry, Site, Price] in a table; experimental)



If array is allocated ONLY on new searcher warming up I am  
_extremely_ happy... I had constant OOMs during past month (SUN  
Java 5).

It is only on warmup - I believe its lazy loaded, so the first time a
search is done (solr does the search as part of warmup I believe) the
fieldcache is loaded. The underlying IndexReader is the key to the
fieldcache, so until you get a new IndexReader (SolrSearcher in solr
world?) the field cache will be good. Keep in mind that as a searcher
is warming, the other search is still serving, so that will up the ram
requirements...and since I think you can have 1 searchers on
deck...you get the idea.

As far as the number I gave, thats from a memory made months and months
ago, so go with what you see.




Quoting Fuad Efendi [EMAIL PROTECTED]:

I've even seen exceptions (posted here) when sort-type queries 
caused

Lucene to allocate 100Mb arrays, here is what happened to me:

SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979
   at
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:360)  
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)  
- it does not happen after I increased from 4096M to 8192M (JRockit

R27; more intelligent stacktrace, isn't it?)

Thanks Mark; I didn't know that it happens only once (on warming up a
searcher).



Quoting Mark Miller [EMAIL PROTECTED]:


Because to sort efficiently, Solr loads the term to sort on for each
doc in the index into an array. For ints,longs, etc its just an array
the size of the number of docs in your index (i believe deleted or
not). For a String its an array to hold each unique string and an 
array

of ints indexing into the String array.

So if you do a sort, and search for something that only gets 1 doc 
as a

hit...your still loading up that field cache for every single doc in
your index on the first search. With solr, this happens in the
background as it warms up the searcher. The end story is, you need 
more

RAM to accommodate the sort most likely...have you upped your xmx
setting? I think you can roughly say a 2 million doc index would need
40-50 MB (depending and rough, but to give an idea) per field your
sorting on.

- Mark

sundar shankar wrote:

Thanks Fuad.
   But why does just sorting provide an OOM. I
executed the query without adding the sort clause it executed
perfectly. In fact I even tried remove the maxrows=10 and   
executed.  it came out fine. Queries with bigger results seems  
to  come out  fine too. But why just sort of that too just 10  
rows??

-Sundar




Date: Tue, 22 Jul 2008 12:24:35 -0700 From: [EMAIL PROTECTED]  
To:   solr-user@lucene.apache.org Subject: RE: Out of memory  
on Solr   sorting 
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403) 
 - this piece of code do not request Array[100M] (as I seen 
with  Lucene), it asks only few bytes / Kb for a field...   
Probably 128 - 512 is not enough; it is also advisable to use 
equal sizes -Xms1024M -Xmx1024M (it minimizes GC frequency, 
and itensures that 1024M is available at startup)  OOM happens 
also with fragmented memory, when application requests big  
contigues fragment and GC is unable to optimize; looks like your 
 application requests a little and memory is not available... 
  Quoting sundar shankar [EMAIL PROTECTED]:
  From: [EMAIL PROTECTED]  To: 
solr-user@lucene.apache.org  Subject: Out of memory on Solr 
sorting  Date: Tue, 22 Jul 2008 19:11:02 +
Hi,  Sorry again fellos. I am not sure whats happening. The 
day with   solr is bad for me I guess. EZMLM didnt let me 
send any mails this   morning. Asked me to confirm 
subscription and when I did, it said I   was already a 
member. Now my

RE: Out of memory on Solr sorting

2008-07-22 Thread sundar shankar
Hi Mark,
I am still getting an OOM even after increasing the heap to 1024. 
The docset I have is 
 
numDocs : 1138976 maxDoc : 1180554 
 
Not sure how much more I would need. Is there any other way out of this. I 
noticed another interesting behavior. I have a Solr setup on a personal Box 
where I try out a lot of different configuration and stuff before I even roll 
the changes out to dev. This server has been running with a similar indexed 
data for a lot longer than the dev box and it seems to have fetched the results 
out properly. 
This box is a windows 2 core processor with just about a gig of memory and the 
whole 1024 megs have been allocated to heap. The dev is a linux with over 2 
Gigs of memory and 1024 allocated to heap now. :S
 
-Sundar



 Date: Tue, 22 Jul 2008 13:17:40 -0700 From: [EMAIL PROTECTED] To: 
 solr-user@lucene.apache.org Subject: Re: Out of memory on Solr sorting  
 Mark,  Question: how much memory I need for 25,000,000 docs if I do a sort 
 by  string field, 256 bytes. 6.4Gb?   Quoting Mark Miller [EMAIL 
 PROTECTED]:   Because to sort efficiently, Solr loads the term to sort on 
 for each  doc in the index into an array. For ints,longs, etc its just an 
 array  the size of the number of docs in your index (i believe deleted or 
  not). For a String its an array to hold each unique string and an array  
 of ints indexing into the String array.   So if you do a sort, and search 
 for something that only gets 1 doc as a  hit...your still loading up that 
 field cache for every single doc in  your index on the first search. With 
 solr, this happens in the  background as it warms up the searcher. The end 
 story is, you need more  RAM to accommodate the sort most likely...have you 
 upped your xmx  setting? I think you can roughly say a 2 million doc index 
 would need  40-50 MB (depending and rough, but to give an idea) per field 
 your  sorting on.   - Mark
_
Wish to Marry Now? Click Here to Register FREE
http://www.shaadi.com/registration/user/index.php?ptnr=mhottag

RE: Out of memory on Solr sorting

2008-07-22 Thread sundar shankar
Thanks for your help Mark. Lemme explore a little more and see if some one else 
can help me out too. :)

 Date: Tue, 22 Jul 2008 16:53:47 -0400 From: [EMAIL PROTECTED] To: 
 solr-user@lucene.apache.org Subject: Re: Out of memory on Solr sorting  
 Someone else is going to have to take over Sundar - I am new to solr  
 myself. I will say this though - 25 million docs is pushing the limits  of a 
 single machine - especially with only 2 gig of RAM, especially with  any 
 sort fields. You are at the edge I believe.  But perhaps you can get by. 
 Have you checked out all the solr stats on  the admin page? Maybe you are 
 trying to load up to many searchers at a  time. I think there is a setting 
 to limit the number of searchers that  can be on deck...  sundar shankar 
 wrote:  Hi Mark,  I am still getting an OOM even after increasing the 
 heap to 1024. The docset I have is numDocs : 1138976 maxDoc : 1180554 
 Not sure how much more I would need. Is there any other way out of 
 this. I noticed another interesting behavior. I have a Solr setup on a 
 personal Box where I try out a lot of different configuration and stuff 
 before I even roll the changes out to dev. This server has been running with 
 a similar indexed data for a lot longer than the dev box and it seems to have 
 fetched the results out properly.   This box is a windows 2 core processor 
 with just about a gig of memory and the whole 1024 megs have been allocated 
 to heap. The dev is a linux with over 2 Gigs of memory and 1024 allocated to 
 heap now. :S-Sundar   Date: Tue, 22 Jul 2008 13:17:40 
 -0700 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: Re: 
 Out of memory on Solr sorting  Mark,  Question: how much memory I need 
 for 25,000,000 docs if I do a sort by  string field, 256 bytes. 6.4Gb?  
  Quoting Mark Miller [EMAIL PROTECTED]:   Because to sort efficiently, 
 Solr loads the term to sort on for each  doc in the index into an array. 
 For ints,longs, etc its just an array  the size of the number of docs in 
 your index (i believe deleted or  not). For a String its an array to hold 
 each unique string and an array  of ints indexing into the String array. 
   So if you do a sort, and search for something that only gets 1 doc as a 
  hit...your still loading up that field cache for every single doc in  
 your index on the first search. With solr, this happens in the  background 
 as it warms up the searcher. The end story is, you need more  RAM to 
 accommodate the sort most likely...have you upped your xmx  setting? I 
 think you can roughly say a 2 million doc index would need  40-50 MB 
 (depending and rough, but to give an idea) per field your  sorting on.  
  - Mark
 _  Wish to 
 Marry Now? Click Here to Register FREE  
 http://www.shaadi.com/registration/user/index.php?ptnr=mhottag
_
Missed your favourite programme? Stop surfing TV channels and start planning 
your weekend TV viewing with our comprehensive TV Listing
http://entertainment.in.msn.com/TV/TVListing.aspx

Re: Out of memory on Solr sorting

2008-07-22 Thread Fuad Efendi


Ok, what is confusing me is implicit guess that FieldCache contains  
field and Lucene uses in-memory sort instead of using file-system  
index...


Array syze: 100Mb (25M x 4 bytes), and it is just pointers (4-byte  
integers) to documents in index.


org.apache.lucene.search.FieldCacheImpl$10.createValue
...
357: protected Object createValue(IndexReader reader, Object fieldKey)
358:   throws IOException {
359:   String field = ((String) fieldKey).intern();
360:   final int[] retArray = new int[reader.maxDoc()]; // OutOfMemoryError!!!
...
408:   StringIndex value = new StringIndex (retArray, mterms);
409:   return value;
410: }
...

It's very confusing, I don't know such internals...


 field name=XXX type=string indexed=true stored=true  
termVectors=true/

 The sorting is done based on string field.



I think Sundar should not use [termVectors=true]...



Quoting Mark Miller [EMAIL PROTECTED]:


Hmmm...I think its 32bits an integer with an index entry for each doc, so


   **25 000 000 x 32 bits = 95.3674316 megabytes**

Then you have the string array that contains each unique term from your
index...you can guess that based on the number of terms in your index
and an avg length guess.

There is some other overhead beyond the sort cache as well, but thats
the bulk of what it will add. I think my memory may be bad with my
original estimate :)

Fuad Efendi wrote:

Thank you very much Mark,

it explains me a lot;

I am guessing: for 1,000,000 documents with a [string] field of   
average size 1024 bytes I need 1Gb for single IndexSearcher   
instance; field-level cache it is used internally by Lucene (can   
Lucene manage size if it?); we can't have 1G of such documents   
without having 1Tb RAM...




Quoting Mark Miller [EMAIL PROTECTED]:


Fuad Efendi wrote:

SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979



I just noticed, this is an exact number of documents in index: 25191979

(http://www.tokenizer.org/, you can sort - click headers Id,
[COuntry, Site, Price] in a table; experimental)



If array is allocated ONLY on new searcher warming up I am
_extremely_ happy... I had constant OOMs during past month (SUN
Java 5).

It is only on warmup - I believe its lazy loaded, so the first time a
search is done (solr does the search as part of warmup I believe) the
fieldcache is loaded. The underlying IndexReader is the key to the
fieldcache, so until you get a new IndexReader (SolrSearcher in solr
world?) the field cache will be good. Keep in mind that as a searcher
is warming, the other search is still serving, so that will up the ram
requirements...and since I think you can have 1 searchers on
deck...you get the idea.

As far as the number I gave, thats from a memory made months and months
ago, so go with what you see.




Quoting Fuad Efendi [EMAIL PROTECTED]:


I've even seen exceptions (posted here) when sort-type queries caused
Lucene to allocate 100Mb arrays, here is what happened to me:

SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray - Object
size: 100767936, Num elements: 25191979
  at
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:360)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)  - it does not happen after I increased from 4096M to 8192M   
(JRockit

R27; more intelligent stacktrace, isn't it?)

Thanks Mark; I didn't know that it happens only once (on warming up a
searcher).



Quoting Mark Miller [EMAIL PROTECTED]:


Because to sort efficiently, Solr loads the term to sort on for each
doc in the index into an array. For ints,longs, etc its just an array
the size of the number of docs in your index (i believe deleted or
not). For a String its an array to hold each unique string and an array
of ints indexing into the String array.

So if you do a sort, and search for something that only gets 1 doc as a
hit...your still loading up that field cache for every single doc in
your index on the first search. With solr, this happens in the
background as it warms up the searcher. The end story is, you need more
RAM to accommodate the sort most likely...have you upped your xmx
setting? I think you can roughly say a 2 million doc index would need
40-50 MB (depending and rough, but to give an idea) per field your
sorting on.

- Mark

sundar shankar wrote:

Thanks Fuad.
  But why does just sorting provide an OOM. I  
executed the query without adding the sort clause it executed   
   perfectly. In fact I even tried remove the maxrows=10 and
 executed.  it came out fine. Queries with bigger results  
seems   to  come out  fine too. But why just sort of that too  
just 10   rows??

-Sundar




Date: Tue, 22 Jul 2008 12:24:35 -0700 From: [EMAIL PROTECTED]   
 To:   solr-user@lucene.apache.org Subject: RE: Out of  
memory   on Solr   sorting   
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403

Re: Out of memory on Solr sorting

2008-07-22 Thread Fuad Efendi
. With solr, this happens in the
background as it warms up the searcher. The end story is, you need more
RAM to accommodate the sort most likely...have you upped your xmx
setting? I think you can roughly say a 2 million doc index would need
40-50 MB (depending and rough, but to give an idea) per field your
sorting on.

- Mark

sundar shankar wrote:

Thanks Fuad.
 But why does just sorting provide an OOM. I   
executed the query without adding the sort clause it executed  
perfectly. In fact I even tried remove the maxrows=10 and  
executed.  it came out fine. Queries with bigger results   
seems   to  come out  fine too. But why just sort of that too  
 just 10   rows??

-Sundar




Date: Tue, 22 Jul 2008 12:24:35 -0700 From: [EMAIL PROTECTED]  
   To:   solr-user@lucene.apache.org Subject: RE: Out of   
memory   on Solr   sorting
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)  - this piece of code do not request Array[100M] (as I seen with  Lucene), it asks only few bytes / Kb for a field...   Probably 128 - 512 is not enough; it is also advisable to use equal sizes -Xms1024M -Xmx1024M (it minimizes GC frequency, and itensures that 1024M is available at startup)  OOM happens also with fragmented memory, when application requests big  contigues fragment and GC is unable to optimize; looks like your  application requests a little and memory is not available...   Quoting sundar shankar [EMAIL PROTECTED]:  From: [EMAIL PROTECTED]  To: solr-user@lucene.apache.org  Subject: Out of memory on Solr sorting  Date: Tue, 22 Jul 2008 19:11:02 +Hi,  Sorry again fellos. I am not sure whats happening. The day with   solr is bad for me I guess. EZMLM didnt let me send any mails this   morning. Asked me to confirm subscription and when I did, it said I   was already a member. Now my mails are all coming out bad. Sorry   for troubling y'all this bad. I hope this mail comes out right.Hi,  We are developing a product in a agile manner and the current   implementation has a data of size just about a 800 megs in dev.  The memory allocated to solr on dev (Dual core Linux box) is 128-512.   My config  =   !-- autocommit pending docs if certain criteria are met  autoCommit  maxDocs1/maxDocs  maxTime1000/maxTime  /autoCommit  --   filterCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=256/   queryResultCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=256/   documentCache  class=solr.LRUCache  size=512  initialSize=512  autowarmCount=0/   enableLazyFieldLoadingtrue/enableLazyFieldLoadingMy Field  ===   fieldType name=autocomplete class=solr.TextField  analyzer type=index  tokenizer class=solr.KeywordTokenizerFactory/  filter class=solr.LowerCaseFilterFactory /  filter class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) replacement= replace=all /  filter class=solr.EdgeNGramFilterFactory   maxGramSize=100 minGramSize=1 /  /analyzer  analyzer type=query  tokenizer class=solr.KeywordTokenizerFactory/  filter class=solr.LowerCaseFilterFactory /  filter class=solr.PatternReplaceFilterFactory   pattern=([^a-z0-9]) replacement= replace=all /  filter class=solr.PatternReplaceFilterFactory   pattern=^(.{20})(.*)? replacement=$1 replace=all /  /analyzer  /fieldTypeProblem  ==   I execute a query that returns 24 rows of result. I pick 10 out of   it. I have no problem when I execute this.  But When I do sort it by a String field that is fetched from this   result. I get an OOM. I am able to execute several  other queries with no problem. Just having a sort asc clause added   to the query throws an OOM. Why is that.  What should I have ideally done. My config on QA is pretty similar   to the dev box and probably has more data than on dev.  It didnt throw any OOM during the integration test. The Autocomplete   is a new field we added recently.   Another point is that the indexing is done with a field of type string  field name=XXX type=string indexed=true stored=true   termVectors=true/   and the autocomplete field is a copy field.   The sorting is done based on string field.   Please do lemme know what mistake am I doing?   Regards  Sundar   P.S: The stack trace of the exception isCaused by: org.apache.solr.client.solrj.SolrServerException: Error   executing query  at   org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:86)  at   org.apache.solr.client.solrj.impl.BaseSolrServer.query(BaseSolrServer.java:101)  at   com.apollo.sisaw.solr.service.AbstractSolrSearchService.makeSolrQuery(AbstractSolrSearchService.java:193)  ... 105 more  Caused by: org.apache.solr.common.SolrException: Java heap space   java.lang.OutOfMemoryError: Java heap space  at   org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:403)  at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)  at   org.apache.lucene.search.FieldCacheImpl.getStringIndex

RE: Out of memory on Solr sorting

2008-07-22 Thread sundar shankar
I haven't seen the source code before, But I don't know why the sorting isn't 
done after the fetch is done. Wouldn't that make it more faster. at least in 
case of field level sorting? I could be wrong too and the implementation might 
probably be better. But don't know why all of the fields have had to be loaded.
 
 



 Date: Tue, 22 Jul 2008 14:26:26 -0700 From: [EMAIL PROTECTED] To: 
 solr-user@lucene.apache.org Subject: Re: Out of memory on Solr sorting   
 Ok, after some analysis of FieldCacheImpl:  - it is supposed that (sorted) 
 Enumeration of terms is less than  total number of documents (so that 
 SOLR uses specific field type for sorted searches:  solr.StrField with 
 omitNorms=true)  It creates int[reader.maxDoc()] array, checks (sorted) 
 Enumeration of  terms (untokenized solr.StrField), and populates array 
 with document  Ids.   - it also creates array of String String[] mterms 
 = new String[reader.maxDoc()+1];  Why do we need that? For 1G document with 
 average term/StrField size  of 100 bytes (which could be unique text!!!) it 
 will create kind of  huge 100Gb cache which is not really needed... 
 StringIndex value = new StringIndex (retArray, mterms);  If I understand 
 correctly... StringIndex _must_ be a file in a  filesystem for such a 
 case... We create StringIndex, and retrieve top  10 documents, huge 
 overhead.  Quoting Fuad Efendi [EMAIL PROTECTED]:Ok, 
 what is confusing me is implicit guess that FieldCache contains  field 
 and Lucene uses in-memory sort instead of using file-system  
 index...   Array syze: 100Mb (25M x 4 bytes), and it is just 
 pointers (4-byte  integers) to documents in index.   
 org.apache.lucene.search.FieldCacheImpl$10.createValue  ...  357: 
 protected Object createValue(IndexReader reader, Object fieldKey)  358: 
 throws IOException {  359: String field = ((String) fieldKey).intern();  
 360: final int[] retArray = new int[reader.maxDoc()]; //   
 OutOfMemoryError!!!  ...  408: StringIndex value = new StringIndex 
 (retArray, mterms);  409: return value;  410: }  ...   It's very 
 confusing, I don't know such internals...field name=XXX 
 type=string indexed=true stored=true   termVectors=true/ 
  The sorting is done based on string field.I think Sundar 
 should not use [termVectors=true]... Quoting Mark Miller 
 [EMAIL PROTECTED]:   Hmmm...I think its 32bits an integer with an 
 index entry for each doc, so**25 000 000 x 32 bits = 95.3674316 
 megabytes**   Then you have the string array that contains each unique 
 term from your  index...you can guess that based on the number of terms in 
 your index  and an avg length guess.   There is some other overhead 
 beyond the sort cache as well, but thats  the bulk of what it will add. I 
 think my memory may be bad with my  original estimate :)   Fuad 
 Efendi wrote:  Thank you very much Mark,   it explains me a lot; 
   I am guessing: for 1,000,000 documents with a [string] field of  
  average size 1024 bytes I need 1Gb for single IndexSearcher   
 instance; field-level cache it is used internally by Lucene (can   Lucene 
 manage size if it?); we can't have 1G of such documents   without having 
 1Tb RAM... Quoting Mark Miller [EMAIL PROTECTED]:  
  Fuad Efendi wrote:  SEVERE: java.lang.OutOfMemoryError: 
 allocLargeObjectOrArray - Object  size: 100767936, Num elements: 
 25191979I just noticed, this is an exact number of 
 documents in index: 25191979   (http://www.tokenizer.org/, you 
 can sort - click headers Id,   [COuntry, Site, Price] in a table; 
 experimental)If array is allocated ONLY on new searcher 
 warming up I am   _extremely_ happy... I had constant OOMs during past 
 month (SUN   Java 5).  It is only on warmup - I believe its lazy 
 loaded, so the first time a  search is done (solr does the search as 
 part of warmup I believe) the  fieldcache is loaded. The underlying 
 IndexReader is the key to the  fieldcache, so until you get a new 
 IndexReader (SolrSearcher in solr  world?) the field cache will be good. 
 Keep in mind that as a searcher  is warming, the other search is still 
 serving, so that will up the ram  requirements...and since I think you 
 can have 1 searchers on  deck...you get the idea.   As far as 
 the number I gave, thats from a memory made months and months  ago, so 
 go with what you see. Quoting Fuad Efendi [EMAIL 
 PROTECTED]:   I've even seen exceptions (posted here) when 
 sort-type queries caused  Lucene to allocate 100Mb arrays, here is 
 what happened to me:   SEVERE: java.lang.OutOfMemoryError: 
 allocLargeObjectOrArray - Object  size: 100767936, Num elements: 
 25191979  at  
 org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:360)
at  
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72) - 
 it does not happen after I increased from 4096M to 8192M   (JRockit 
  R27; more intelligent stacktrace, isn't it?)   Thanks 
 Mark; I didn't know that it happens only

RE: Out of memory on Solr sorting

2008-07-22 Thread Fuad Efendi
I am hoping [new StringIndex (retArray, mterms)] is called only once  
per-sort-field and cached somewhere at Lucene;


theoretically you need multiply number of documents on size of field  
(supposing that field contains unique text); you need not tokenize  
this field; you need not store TermVector.


for 2 000 000 documents with simple untokenized text field such as  
title of book (256 bytes) you need probably 512 000 000 bytes per  
Searcher, and as Mark mentioned you should limit number of searchers  
in SOLR.


So that Xmx512M is definitely not enough even for simple cases...


Quoting sundar shankar [EMAIL PROTECTED]:

I haven't seen the source code before, But I don't know why the   
sorting isn't done after the fetch is done. Wouldn't that make it   
more faster. at least in case of field level sorting? I could be   
wrong too and the implementation might probably be better. But don't  
 know why all of the fields have had to be loaded.






Date: Tue, 22 Jul 2008 14:26:26 -0700 From: [EMAIL PROTECTED] To:   
solr-user@lucene.apache.org Subject: Re: Out of memory on Solr   
sorting   Ok, after some analysis of FieldCacheImpl:  - it is   
supposed that (sorted) Enumeration of terms is less than  total   
number of documents (so that SOLR uses specific field type for   
sorted searches:  solr.StrField with omitNorms=true)  It   
creates int[reader.maxDoc()] array, checks (sorted) Enumeration of   
 terms (untokenized solr.StrField), and populates array with   
document  Ids.   - it also creates array of String String[]   
mterms = new String[reader.maxDoc()+1];  Why do we need that? For  
 1G document with average term/StrField size  of 100 bytes (which   
could be unique text!!!) it will create kind of  huge 100Gb cache   
which is not really needed... StringIndex value = new StringIndex   
(retArray, mterms);  If I understand correctly... StringIndex   
_must_ be a file in a  filesystem for such a case... We create   
StringIndex, and retrieve top  10 documents, huge overhead. 
   Quoting Fuad Efendi [EMAIL PROTECTED]:Ok, what is   
confusing me is implicit guess that FieldCache contains  field   
and Lucene uses in-memory sort instead of using file-system
index...   Array syze: 100Mb (25M x 4 bytes), and it is   
just pointers (4-byte  integers) to documents in index. 
org.apache.lucene.search.FieldCacheImpl$10.createValue  ...
357: protected Object createValue(IndexReader reader, Object   
fieldKey)  358: throws IOException {  359: String field =   
((String) fieldKey).intern();  360: final int[] retArray = new   
int[reader.maxDoc()]; //   OutOfMemoryError!!!  ...  408:   
StringIndex value = new StringIndex (retArray, mterms);  409:   
return value;  410: }  ...   It's very confusing, I don't   
know such internals...field name=XXX type=string  
 indexed=true stored=true   termVectors=true/
The sorting is done based on string field.I think Sundar   
should not use [termVectors=true]... Quoting Mark   
Miller [EMAIL PROTECTED]:   Hmmm...I think its 32bits an  
 integer with an index entry for each doc, so**25 000   
000 x 32 bits = 95.3674316 megabytes**   Then you have the   
string array that contains each unique term from your
index...you can guess that based on the number of terms in your   
index  and an avg length guess.   There is some other   
overhead beyond the sort cache as well, but thats  the bulk of   
what it will add. I think my memory may be bad with my  original  
 estimate :)   Fuad Efendi wrote:  Thank you very much   
Mark,   it explains me a lot;   I am guessing: for   
1,000,000 documents with a [string] field of   average size   
1024 bytes I need 1Gb for single IndexSearcher   instance;   
field-level cache it is used internally by Lucene (can   Lucene  
 manage size if it?); we can't have 1G of such documents 
without having 1Tb RAM... Quoting Mark Miller   
[EMAIL PROTECTED]:   Fuad Efendi wrote:
SEVERE: java.lang.OutOfMemoryError: allocLargeObjectOrArray -   
Object  size: 100767936, Num elements: 25191979
  I just noticed, this is an exact number of documents   
in index: 25191979   (http://www.tokenizer.org/, you   
can sort - click headers Id,   [COuntry, Site, Price] in a   
table; experimental)If array is allocated   
ONLY on new searcher warming up I am   _extremely_ happy... I  
 had constant OOMs during past month (SUN   Java 5).  It  
 is only on warmup - I believe its lazy loaded, so the first time  
a   search is done (solr does the search as part of warmup I   
believe) the  fieldcache is loaded. The underlying IndexReader  
 is the key to the  fieldcache, so until you get a new   
IndexReader (SolrSearcher in solr  world?) the field cache   
will be good. Keep in mind that as a searcher  is warming, the  
 other search is still serving, so that will up the ram
requirements...and since I think you can have 1 searchers on   
 deck...you get the idea.   As far as the number I gave

RE: Out of memory on Solr sorting

2008-07-22 Thread Fuad Efendi
Yes, it is a cache, it stores sorted by sorted field array of  
Document IDs together with sorted fields; query results can intersect  
with it and reorder accordingly.


But memory requirements should be well documented.

It uses internally WeakHashMap which is not good(!!!) - a lot of  
underground warming ups of caches which SOLR is not aware of...  
Could be.


I think Lucene-SOLR developers should join this discussion:


/**
 * Expert: The default cache implementation, storing all values in memory.
 * A WeakHashMap is used for storage.
 *
..

  // inherit javadocs
  public StringIndex getStringIndex(IndexReader reader, String field)
  throws IOException {
return (StringIndex) stringsIndexCache.get(reader, field);
  }

  Cache stringsIndexCache = new Cache() {

protected Object createValue(IndexReader reader, Object fieldKey)
throws IOException {
  String field = ((String) fieldKey).intern();
  final int[] retArray = new int[reader.maxDoc()];
  String[] mterms = new String[reader.maxDoc()+1];
  TermDocs termDocs = reader.termDocs();
  TermEnum termEnum = reader.terms (new Term (field, ));






Quoting Fuad Efendi [EMAIL PROTECTED]:


I am hoping [new StringIndex (retArray, mterms)] is called only once
per-sort-field and cached somewhere at Lucene;

theoretically you need multiply number of documents on size of field
(supposing that field contains unique text); you need not tokenize this
field; you need not store TermVector.

for 2 000 000 documents with simple untokenized text field such as
title of book (256 bytes) you need probably 512 000 000 bytes per
Searcher, and as Mark mentioned you should limit number of searchers in
SOLR.

So that Xmx512M is definitely not enough even for simple cases...


Quoting sundar shankar [EMAIL PROTECTED]:

I haven't seen the source code before, But I don't know why the
sorting isn't done after the fetch is done. Wouldn't that make it
more faster. at least in case of field level sorting? I could be
wrong too and the implementation might probably be better. But   
don't  know why all of the fields have had to be loaded.






Date: Tue, 22 Jul 2008 14:26:26 -0700 From: [EMAIL PROTECTED] To:
solr-user@lucene.apache.org Subject: Re: Out of memory on Solr
sorting   Ok, after some analysis of FieldCacheImpl:  - it is  
  supposed that (sorted) Enumeration of terms is less than
total  number of documents (so that SOLR uses specific field type  
 for  sorted searches:  solr.StrField with omitNorms=true)   
It   creates int[reader.maxDoc()] array, checks (sorted)  
Enumeration  of   terms (untokenized solr.StrField), and  
populates array  with  document  Ids.   - it also creates  
array of String  String[]  mterms = new  
String[reader.maxDoc()+1];  Why do we  need that? For  1G  
document with average term/StrField size  of  100 bytes (which   
could be unique text!!!) it will create kind of   huge 100Gb  
cache  which is not really needed... StringIndex  value = new  
StringIndex  (retArray, mterms);  If I understand  correctly...  
StringIndex  _must_ be a file in a  filesystem for  such a  
case... We create  StringIndex, and retrieve top  10  documents,  
huge overhead.   

 Quoting Fuad Efendi [EMAIL PROTECTED]:Ok, what is
confusing me is implicit guess that FieldCache contains  field  
  and Lucene uses in-memory sort instead of using file-system 
index...   Array syze: 100Mb (25M x 4 bytes), and it is   
 just pointers (4-byte  integers) to documents in index.  
org.apache.lucene.search.FieldCacheImpl$10.createValue  ... 
357: protected Object createValue(IndexReader reader, Object
fieldKey)  358: throws IOException {  359: String field =
((String) fieldKey).intern();  360: final int[] retArray = new
int[reader.maxDoc()]; //   OutOfMemoryError!!!  ...  408:
StringIndex value = new StringIndex (retArray, mterms);  409:
return value;  410: }  ...   It's very confusing, I don't   
 know such internals...field name=XXX   
type=string  indexed=true stored=true 
termVectors=true/   The sorting is done based on string   
field.I think Sundar  should not use   
[termVectors=true]... Quoting Mark  Miller   
[EMAIL PROTECTED]:   Hmmm...I think its 32bits an
integer with an index entry for each doc, so**25 000   
 000 x 32 bits = 95.3674316 megabytes**   Then you have the   
 string array that contains each unique term from your 
index...you can guess that based on the number of terms in your
index  and an avg length guess.   There is some other
overhead beyond the sort cache as well, but thats  the bulk of   
 what it will add. I think my memory may be bad with my
original  estimate :)   Fuad Efendi wrote:  Thank you   
very much  Mark,   it explains me a lot;   I am   
guessing: for  1,000,000 documents with a [string] field of 
average size  1024 bytes I need 1Gb for single IndexSearcher