Solr 4.3: Recovering from Too many values for UnInvertedField faceting on field

2013-09-03 Thread Dennis Schafroth
We are harvesting and indexing bibliographic data, thus having many distinct 
author names in our index. While testing Solr 4 I believe I had pushed a single 
core to 100 million records (91GB of data) and everything was working fine and 
fast. After adding a little more to the index, then following started to happen:

17328668 [searcherExecutor-4-thread-1] WARN org.apache.solr.core.SolrCore – 
Approaching too many values for UnInvertedField faceting on field 
'author_exact' : bucket size=16726546
17328701 [searcherExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – 
UnInverted multi-valued field 
{field=author_exact,memSize=336715415,tindexSize=5001903,time=31595,phase1=31465,nTerms=12048027,bigTerms=0,termInstances=57751332,uses=0}
18103757 [searcherExecutor-4-thread-1] ERROR org.apache.solr.core.SolrCore – 
org.apache.solr.common.SolrException: Too many values for UnInvertedField 
faceting on field author_exact
at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:181)
at 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664)

I can see that we reached a limit of bucket size. Is there a way to adjust 
this? The index also seem to explode in size (217GB).

Thinking that I had reached a limit for what a single core could handle in 
terms of facet, I deleted records in the index, but even now at 1/3 (32 
million) it will still fails with above error. I have optimised with 
expungeDeleted=true. The index is  somewhat larger (76GB) than I would have 
expected.

While we can still use the index and get facets back using enum method on that 
field, I would still like a way to fix the index if possible. Any suggestions? 

cheers, 
:-Dennis

Re: Slow first searcher with facet on bibliographic data in Master - Slave

2012-03-29 Thread Dennis Schafroth

I do have a firstSearcher, but currently coldSearcher is set to true. But 
doesn't this just mean that that any searches will block while the first 
searcher is running? This is how the comment describes first searcher. It would 
almost give the same effect; that some searches take a long time.   

What I am looking for is after receiving replicated data, do first searcher and 
then switch to new index. 

I will try with coldSearcher false, but I actually think I have already tried 
this. 

cheers, 
:-Dennis

On Mar 29, 2012, at 13:57 , fbrisbart wrote:

 If you add your query to the firstSearcher and/or newSearcher event
 listeners in the slave
 'solrconfig.xml' ( 
 http://wiki.apache.org/solr/SolrCaching#newSearcher_and_firstSearcher_Event_Listeners
  ),
 
 each new search instance will wait before accepting queries.
 
 Example to load the FieldCache for 'your_facet_field' field :
 ...
listener event=firstSearcher class=solr.QuerySenderListener
  arr name=queries
lststr name=q*:*/strstr name=facettrue/strstr
 name=facet.fieldyour_facet_field/str/lst
  /arr
/listener
 ...
 
 
 Franck
 
 Le jeudi 29 mars 2012 à 13:30 +0200, Dennis Schafroth a écrit :
 Hi 
  
 I am running indexing and facetted searching on bibliographic data, which is 
 known not to perform to well due to the high facet count. Actually it's just 
 the firstSearch that is horrible slow, 200+ seconds  . After that, I am 
 getting okay times (1 second) (at least in a few users scenario we have 
 now). 
 
 The current index is 54 millions record with approx. 10 millions unique 
 authors. The facets (… _exact) is using the string type. 
 
 I had hoped that a master (indexing) and slave (searching) would have solved 
 the issue, but I am still seeing the issue on the slave, so I guess I must 
 have misunderstood (or perhaps misconfigured) something
 
 I had thought that the slave would not switch to the new index until the 
 auto warming was completed.  Is such behavior possible? 
 
 I guess a alternative solution could be to have multiple slaves and taking a 
 slave off-line when doing replication, but if it is possible to do simpler 
 (and using 1/3 less space) that would be great. Then again we might need 
 multiple slaves with more requests.
 
 Attached is the configuration files.
 
 Let me know if there is missing information. 
 
 cheers, 
 :-Dennis Schafroth
 
 
 
 



Re: Slow first searcher with facet on bibliographic data in Master - Slave

2012-03-29 Thread Dennis Schafroth

On Mar 29, 2012, at 14:49 , fbrisbart wrote:

 Arf, I didn't see your attached tgz.
 
 In your slave solrconfig.xml, only the 'firstSearcher' contains the
 query. Add it also in the 'newSearcher', so that the new search
 instances will wait also after a new index is replicated.

Did that now, but I believe my case is mostly a first searcher issue. Anyway it 
didn't seem to change anything. 

 
 The first request is long because the default faceting method uses the
 FieldCache for your facet fields.

Jup, i know. 

 You may also choose to use the facet.method=enum  The performance is
 globally worse

You say. This means that every search with facets is now 20 seconds instead of 
2. Then I prefer the field cache with one bad first search. 

 than the 'fc' method, but you will avoid the very slow
 first request. Btw, it's far better to use the default 'enum' facet
 method.

Thanks for the input so far. 

 
 Hope this helps,
 Franck
 
 
 
 
 
 
 Le jeudi 29 mars 2012 à 13:57 +0200, fbrisbart a écrit :
 If you add your query to the firstSearcher and/or newSearcher event
 listeners in the slave
 'solrconfig.xml' ( 
 http://wiki.apache.org/solr/SolrCaching#newSearcher_and_firstSearcher_Event_Listeners
  ),
 
 each new search instance will wait before accepting queries.
 
 Example to load the FieldCache for 'your_facet_field' field :
 ...
listener event=firstSearcher class=solr.QuerySenderListener
  arr name=queries
lststr name=q*:*/strstr name=facettrue/strstr
 name=facet.fieldyour_facet_field/str/lst
  /arr
/listener
 ...
 
 
 Franck
 
 Le jeudi 29 mars 2012 à 13:30 +0200, Dennis Schafroth a écrit :
 Hi 
 
 I am running indexing and facetted searching on bibliographic data, which 
 is known not to perform to well due to the high facet count. Actually it's 
 just the firstSearch that is horrible slow, 200+ seconds  . After that, I 
 am getting okay times (1 second) (at least in a few users scenario we have 
 now). 
 
 The current index is 54 millions record with approx. 10 millions unique 
 authors. The facets (… _exact) is using the string type. 
 
 I had hoped that a master (indexing) and slave (searching) would have 
 solved the issue, but I am still seeing the issue on the slave, so I guess 
 I must have misunderstood (or perhaps misconfigured) something
 
 I had thought that the slave would not switch to the new index until the 
 auto warming was completed.  Is such behavior possible? 
 
 I guess a alternative solution could be to have multiple slaves and taking 
 a slave off-line when doing replication, but if it is possible to do 
 simpler (and using 1/3 less space) that would be great. Then again we might 
 need multiple slaves with more requests.
 
 Attached is the configuration files.
 
 Let me know if there is missing information. 
 
 cheers, 
 :-Dennis Schafroth
 
 
 
 
 
 



Re: Slow first searcher with facet on bibliographic data in Master - Slave

2012-03-29 Thread Dennis Schafroth
I was wrong! It does seem to work! 

Thanks a bunch! 

cheers,
:-Dennis

On Mar 29, 2012, at 15:52 , fbrisbart wrote:

 I had the same issue months ago.
 'newSearcher' fixed the problem for me.
 I also remember that I had to upgrade solr (3.1) because it didn't work
 with release 1.4 
 But, I suppose you already have a solr 3.x or more.
 So I'm afraid I can't help you more :o(
 
 Franck
 
 
 Le jeudi 29 mars 2012 à 15:41 +0200, Dennis Schafroth a écrit :
 On Mar 29, 2012, at 14:49 , fbrisbart wrote:
 
 Arf, I didn't see your attached tgz.
 
 In your slave solrconfig.xml, only the 'firstSearcher' contains the
 query. Add it also in the 'newSearcher', so that the new search
 instances will wait also after a new index is replicated.
 
 Did that now, but I believe my case is mostly a first searcher issue. Anyway 
 it didn't seem to change anything. 
 
 
 The first request is long because the default faceting method uses the
 FieldCache for your facet fields.
 
 Jup, i know. 
 
 You may also choose to use the facet.method=enum  The performance is
 globally worse
 
 You say. This means that every search with facets is now 20 seconds instead 
 of 2. Then I prefer the field cache with one bad first search. 
 
 than the 'fc' method, but you will avoid the very slow
 first request. Btw, it's far better to use the default 'enum' facet
 method.
 I meant the default 'fc' method of course :o)
 
 
 Thanks for the input so far. 
 
 
 Hope this helps,
 Franck
 
 
 
 
 
 
 Le jeudi 29 mars 2012 à 13:57 +0200, fbrisbart a écrit :
 If you add your query to the firstSearcher and/or newSearcher event
 listeners in the slave
 'solrconfig.xml' ( 
 http://wiki.apache.org/solr/SolrCaching#newSearcher_and_firstSearcher_Event_Listeners
  ),
 
 each new search instance will wait before accepting queries.
 
 Example to load the FieldCache for 'your_facet_field' field :
 ...
   listener event=firstSearcher class=solr.QuerySenderListener
 arr name=queries
   lststr name=q*:*/strstr name=facettrue/strstr
 name=facet.fieldyour_facet_field/str/lst
 /arr
   /listener
 ...
 
 
 Franck
 
 Le jeudi 29 mars 2012 à 13:30 +0200, Dennis Schafroth a écrit :
 Hi 
   
 I am running indexing and facetted searching on bibliographic data, which 
 is known not to perform to well due to the high facet count. Actually 
 it's just the firstSearch that is horrible slow, 200+ seconds  . After 
 that, I am getting okay times (1 second) (at least in a few users 
 scenario we have now). 
 
 The current index is 54 millions record with approx. 10 millions unique 
 authors. The facets (… _exact) is using the string type. 
 
 I had hoped that a master (indexing) and slave (searching) would have 
 solved the issue, but I am still seeing the issue on the slave, so I 
 guess I must have misunderstood (or perhaps misconfigured) something
 
 I had thought that the slave would not switch to the new index until the 
 auto warming was completed.  Is such behavior possible? 
 
 I guess a alternative solution could be to have multiple slaves and 
 taking a slave off-line when doing replication, but if it is possible to 
 do simpler (and using 1/3 less space) that would be great. Then again we 
 might need multiple slaves with more requests.
 
 Attached is the configuration files.
 
 Let me know if there is missing information. 
 
 cheers, 
 :-Dennis Schafroth
 
 
 
 
 
 
 
 
 
 



Re: Solr memory consumption

2011-06-02 Thread Dennis Schafroth

I ran out of memory on some big indexes when using solr 1.4. Found out that 
increasing

termInfosIndexDivisor

in solrconfig.xml could help a lot. 

It may slow down your searching your index.

cheers,
:-Dennis


On 02/06/2011, at 01.16, Alexey Serba wrote:

 Hey Denis,
 
 * How big is your index in terms of number of documents and index size?
 * Is it production system where you have many search requests?
 * Is there any pattern for OOM errors? I.e. right after you start your
 Solr app, after some search activity or specific Solr queries, etc?
 * What are 1) cache settings 2) facets and sort-by fields 3) commit
 frequency and warmup queries?
 etc
 
 Generally you might want to connect to your jvm using jconsole tool
 and monitor your heap usage (and other JVM/Solr numbers)
 
 * http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
 * http://wiki.apache.org/solr/SolrJmx#Remote_Connection_to_Solr_JMX
 
 HTH,
 Alexey
 
 2011/6/1 Denis Kuzmenok forward...@ukr.net:
 There  were  no  parameters  at  all,  and java hitted out of memory
 almost  every day, then i tried to add parameters but nothing changed.
 Xms/Xmx  -  did  not solve the problem too. Now i try the MaxPermSize,
 because it's the last thing i didn't try yet :(
 
 
 Wednesday, June 1, 2011, 9:00:56 PM, you wrote:
 
 Could be related to your crazy high MaxPermSize like Marcus said.
 
 I'm no JVM tuning expert either. Few people are, it's confusing. So if
 you don't understand it either, why are you trying to throw in very
 non-standard parameters you don't understand?  Just start with whatever
 the Solr example jetty has, and only change things if you have a reason
 to (that you understand).
 
 On 6/1/2011 1:19 PM, Denis Kuzmenok wrote:
 Overall  memory on server is 24G, and 24G of swap, mostly all the time
 swap  is  free and is not used at all, that's why no free swap sound
 strange to me..
 
 
 
 
 
 



Re: solrj issue: SocketTimeout: read timed out, but commit succed on server.

2011-05-17 Thread Dennis Schafroth

It also happens on add records.  

Putting a proxy in between client and server, revealed that the server writes 
zero bytes back on the update, so what the client says is correct. So guess I 
have to dig into the server code.

Limiting to fewer updates before commit does seem to make the change of success 
higher.

Any input will greatly appreciated. 

cheers, 
:-Dennis

On 17/05/2011, at 14.43, Dennis Schafroth wrote:

 Hi
 
 I can see others is having same issue but haven't seen any fixes or work 
 around. 
 
 
 I am adding and delete records mixed. I do bulks up till 1000 records. On the 
 commit I see the following in the client: 
 
 2011-05-17 13:42:41 ERROR - harvester 
 [main/com.indexdata.masterkey.localindices.harvest.storage.SolrRecordStorage] 
 - Commit failed when adding 39900 and deleting 11666.
 org.apache.solr.client.solrj.SolrServerException: 
 java.net.SocketTimeoutException: Read timed out
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
   at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
   at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
   at 
 com.indexdata.masterkey.localindices.harvest.storage.SolrRecordStorage.commit(SolrRecordStorage.java:47)
   at 
 com.indexdata.masterkey.localindices.harvest.storage.BulkSolrRecordStorage.commit(BulkSolrRecordStorage.java:101)
   at 
 com.indexdata.masterkey.localindices.harvest.job.OAIRecordHarvestJob.run(OAIRecordHarvestJob.java:146)
   at 
 com.indexdata.masterkey.localindices.harvest.job.TestOAIRecordHarvestJob.TestCleanFullBulkHarvestJob(TestOAIRecordHarvestJob.java:65)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at junit.framework.TestCase.runTest(TestCase.java:164)
   at junit.framework.TestCase.runBare(TestCase.java:130)
   at junit.framework.TestResult$1.protect(TestResult.java:106)
   at junit.framework.TestResult.runProtected(TestResult.java:124)
   at junit.framework.TestResult.run(TestResult.java:109)
   at junit.framework.TestCase.run(TestCase.java:120)
   at 
 org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
   at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.net.SocketTimeoutException: Read timed out
   at java.net.SocketInputStream.socketRead0(Native Method)
   at java.net.SocketInputStream.read(SocketInputStream.java:129)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
   at 
 org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
   at 
 org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
   at 
 org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
   at 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)
   at 
 org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
   at 
 org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
   at 
 org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)
   ... 24 more
 
 
 But the server seems pretty happy anyway: 
 
 17-05-2011 13:42:40 org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start 
 commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
 17-05-2011 13:42

Import Handler for tokenizing facet string into multi-valued solr.StrField..

2011-01-27 Thread Dennis Schafroth
Hi, 

Pretty novice into SOLR coding, but looking for hints about how (if not already 
done) to implement a PatternTokenizer, that would index this into multivalie 
fields of solr.StrField for facetting. Ex. 

Water -- Irrigation ; Water -- Sewage

should be tokenized into 

Water
Irrigation
Sewage

in multi-valued non-tokenized fields due to performance. I could do it from the 
outside, but I would this as a opportunity to learn about SOLR.

It works as I want with the PatternTokenizerFactory when I am using 
solr.TextField, but not when I am using the non-tokenized solr.StrField. But 
according to reading, facets performance is better on non-tokenized fields. We 
need better performance on our faceted searches on these multi-value fields.  
(25 million documents, three multi-valued facets)

I would also need to have a filter that filter out identical values as the 
feeds have redundant data as shown above.

Can anyone point point me in the right direction..

cheers, 
:-Dennis 

Re: Import Handler for tokenizing facet string into multi-valued solr.StrField..

2011-01-27 Thread Dennis Schafroth
Thanks for the hints! 

Sorry about stealing the thread query range in multivalued date field 
Mistakenly responded to it. 

cheers,
:-Dennis 

On 27/01/2011, at 16.48, Erik Hatcher wrote:

 Beyond what Erick said, I'll add that it is often better to do this from the 
 outside and send in multiple actual end-user displayable facet values.  When 
 you send in a field like Water -- Irrigation ; Water -- Sewage, that is 
 what will get stored (if you have it set to stored), but what you might 
 rather want is each individual value stored, which can only be done by the 
 indexer sending in multiple values, not through just tokenization.
 
   Erik
 
 On Jan 27, 2011, at 09:09 , Dennis Schafroth wrote:
 
 Hi, 
 
 Pretty novice into SOLR coding, but looking for hints about how (if not 
 already done) to implement a PatternTokenizer, that would index this into 
 multivalie fields of solr.StrField for facetting. Ex. 
 
 Water -- Irrigation ; Water -- Sewage
 
 should be tokenized into 
 
 Water
 Irrigation
 Sewage
 
 in multi-valued non-tokenized fields due to performance. I could do it from 
 the outside, but I would this as a opportunity to learn about SOLR.
 
 It works as I want with the PatternTokenizerFactory when I am using 
 solr.TextField, but not when I am using the non-tokenized solr.StrField. But 
 according to reading, facets performance is better on non-tokenized fields. 
 We need better performance on our faceted searches on these multi-value 
 fields.  (25 million documents, three multi-valued facets)
 
 I would also need to have a filter that filter out identical values as the 
 feeds have redundant data as shown above.
 
 Can anyone point point me in the right direction..
 
 cheers, 
 :-Dennis
 
 



Garbled facets even in a zero hit search

2010-09-09 Thread Dennis Schafroth
Hi,Running on aDebian 5.0.564bit box. Usingsolr-1.4.1 with Javaversion "1.6.0_20"I am seeing weird facets results along with the "right" looking ones. Garbled data, stuff that looks like a buffer overflow / index off by ...And I even get them when I do a zero hit search. I wouldn't expect any facets:?xml version="1.0" encoding="UTF-8"?response lst name="responseHeader"  int name="status"0/int  int name="QTime"56/int  lst name="params"   str name="facet"true/str   str name="shards"satay:8985/solr/str   str name="start"0/str   str name="q"title:xzyzx/str   str name="f.date.facet.limit"10/str   str name="f.subject_exact.facet.limit"10/str   arr name="facet.field"strauthor_exact/strstrdate/strstrsubject_exact/str   /arr   str name="f.author_exact.facet.limit"10/str   str name="rows"20/str  /lst /lst result name="response" numFound="0" start="0"/ lst name="facet_counts"  lst name="facet_queries"/  lst name="facet_fields"   lst name="author_exact"int name=" "0/intint name=" !;;!"0/intint name=" (Domingo, Juan); Imprenta Tormentaria (Córdoba)"0/intint name=" (Supervisor)"0/intint name=" *"0/intint name=" * "0/intint name=" * (μτφρ.)"0/intint name=" * * * "0/intint name=" * * * (μτφρ.)"0/intint name=" * * * *"0/int   /lst   lst name="date"int name=""0/intint name="0001"0/intint name="0002"0/intint name="0003"0/intint name="0004"0/intint name="0005"0/intint name="0006"0/intint name="0007"0/intint name="0008"0/intint name="0009"0/int   /lst   lst name="subject_exact"int name=" "0/intint name=" ! ! R P R"0/intint name=" !!rrqqyyhqhqwwllrqrqdd!!vvddvv"0/intint name=" !quot;quot;$%quot;( )*+,($quot;("0/intint name=" !()+, -./01 23456"0/intint name=" !-decidable and decidable deductive procedures for a restricted FTL with Unless"0/intint name=" !lt;f87.03..."0/int    int name=" quot;)338-8570"0/intint name=" quot;-Optimization Schemes and L-Bit Precision: Alternative Perspectives in Combinatorial Optimization"0/intint name=" quot;A picture is worth 1K wordsquot;"0/int   /lst  /lst  lst name="facet_dates"/ /lst/response

response_formated.xml
Description: XML document
I tried to look for a bug report, but haven't been able to find one that matches. I will try to setup a debug session to get closer, but would love to get feedback if this is a know issue.cheers,:-Dennis Schafroth