Re: TermInfosReader.get ArrayIndexOutOfBoundsException
Thanks Lance and Michael, We are running Solr 1.3.0.2009.09.03.11.14.39 (Complete version info from Solr admin panel appended below) I tried running CheckIndex (with the -ea: switch ) on one of the shards. CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger segment containing 500K+ documents. (Complete CheckIndex output appended below) Is it likely that all 10 shards are corrupted? Is it possible that we have simply exceeded some lucene limit? I'm wondering if we could have exceeded the lucene limit of unique terms of 2.1 billion as mentioned towards the end of the Lucene Index File Formats document. If the small 731 document index has nine million unique terms as reported by check index, then even though many terms are repeated, it is concievable that the 500,000 document index could have more than 2.1 billion terms. Do you know if the number of terms reported by CheckIndex is the number of unique terms? On the other hand, we previously optimized a 1 million document index down to 1 segment and had no problems. That was with an earlier version of Solr and did not include CommonGrams which could conceivably increase the number of terms in the index by 2 or 3 times. Tom --- Solr Specification Version: 1.3.0.2009.09.03.11.14.39 Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 11:14:39 Lucene Specification Version: 2.9-dev Lucene Implementation Version: 2.9-dev 779312 - 2009-05-27 17:19:55 [tburt...@slurm-4 ~]$ java -Xmx4096m -Xms4096m -cp /l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib/lucene-core-2.9-dev.jar:/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /l/solrs/1/.snapshot/serve-2010-02-07/data/index Opening index @ /l/solrs/1/.snapshot/serve-2010-02-07/data/index Segments file=segments_zo numSegments=2 version=FORMAT_DIAGNOSTICS [Lucene 2.9] 1 of 2: name=_29dn docCount=554799 compound=false hasProx=true numFiles=9 size (MB)=267,131.261 diagnostics = {optimize=true, mergeFactor=2, os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true, lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge, os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_29dn_7.del] test: open reader.OK [184 deleted docs] test: fields, norms...OK [6 fields] test: terms, freq, prox...FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.ArrayIndexOutOfBoundsException: -16777214 at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218) at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:474) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:715) 2 of 2: name=_29im docCount=731 compound=false hasProx=true numFiles=8 size (MB)=421.261 diagnostics = {optimize=true, mergeFactor=3, os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true, lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge, os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} no deletions test: open reader.OK test: fields, norms...OK [6 fields] test: terms, freq, prox...OK [9504552 terms; 34864047 terms/docs pairs; 144869629 tokens] test: stored fields...OK [3550 total field count; avg 4.856 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] WARNING: 1 broken segments (containing 554615 documents) detected WARNING: would write new segments file, and 554615 documents would be lost, if -fix were specified [tburt...@slurm-4 ~]$ The index is corrupted. In some places ArrayIndex and NPE are not wrapped as CorruptIndexException. Try running your code with the Lucene assertions on. Add this to the JVM arguments: -ea:org.apache.lucene... -- View this message in context: http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p27518800.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Thanks Robert!
+1 And thanks to you both for all your work on CommonGrams! Tom Burton-West Jason Rutherglen-2 wrote: > > Robert, thanks for redoing all the Solr analyzers to the new API! It > helps to have many examples to work from, best practices so to speak. > > -- View this message in context: http://old.nabble.com/Thanks-Robert%21-tp27460899p27472503.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Contributors - Solr in Action Case Studies
Hello Otis, Hi Otis, We are using Solr to provide indexing for the full text of 5 million books (About 4-6 terrabytes of text.) Our index is currently around 3 terrabytes distributed over 10 shards with about 310 GB of index per shard. We are using very large Solr documents (about 750MB of text or about 100,000 words/doc), and using CommonGrams to deal with stopwords/common words in multiple languages. I would be interested in contributing a chapter if this sounds interesting. More details about the project are available at: http://www.hathitrust.org/large_scale_search http://www.hathitrust.org/large_scale_search and our blog: http://www.hathitrust.org/blogs/large-scale-search http://www.hathitrust.org/blogs/large-scale-search (I'll be updating the blog with details of current hardware and performance tests in the next week or so) Tom Tom Burton-West Digital Library Production Service University of Michigan Library -- View this message in context: http://old.nabble.com/Contributors---Solr-in-Action-Case-Studies-tp27166564p27249616.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Slow Phrase Queries
You might try a couple tests in the Solr admin interface to make sure the query is being processed the same in both Solr and raw lucene. 1) use the analysis panel to determine if the Solr filter chain is doing something unexpected compared to your lucene filter chain 2) try running a debug query from the Admin tool interface in Solr and then in Lucene to see if the query is being parsed or otherwise interpreted differently. Tom DHast wrote: > > Hello, > I have recently installed Solr as an alternative to our home made lucene > search servers, and while in most respects the performance is better, i > notice that phrase searches are incredibly slow compared to normal lucene, > primarily when using facets > > example: > "City of New York, Matter of" takes 11 seconds > City of New York, Matter of takes 1 second > > the same searches using raw lucene take 5 seconds and 3 seconds > respectively. > > i tried cutting out as much as i could from solrconfig without breaking > it, is there anything else i could try doing to make solr perform > similarly to raw lucene as far as phrase queries are concerned? > thanks > -- View this message in context: http://www.nabble.com/Slow-Phrase-Queries-tp2597p25980562.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Limit of Index size per machine..
Hello, I think you are confusing the size of the data you want to index with the size of the index. For our indexes (large full text documents) the Solr index is about 1/3 of the size of the documents being indexed. For 3 TB of data you might have an index of 1 TB or less. This depends on many factors in your index configuration, including whether you store fields. What kind of performance do you need for indexing time and for search response time? We are trying to optimize search response time and have been running tests on a 225GB Solr index with 32GB of ram and are getting 95% of our test queries returning in less than a second. However, the slowest 1% of queries are returning 5 and 10 seconds. On the other hand it takes almost a week to index about 670GB of full text documents. We will be scaling up to 3 million documents which will be about 2 TB of text and 0.75 TB index size. We plan to distribute the index across 5 machines. More information on our setup and results is available at:http://www.hathitrust.org/blogs/large-scale-search Tom > > The expected processed log file size per day: 100 GB > > We are expecting to retain these indexes for 30 days > (100*30 ~ 3 TB). >>>That means we need approximately 3000 GB (Index Size)/24 GB (RAM) = 125 servers. It would be very hard to convince my org to go for 125 servers for log management of 3 Terabytes of indexes. Has any one used, solr for processing and handling of the indexes of the order of 3 TB ? If so how many servers were used for indexing alone. Thanks, sS -- View this message in context: http://www.nabble.com/Limit-of-Index-size-per-machine..-tp24833163p24853662.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: port of Nutch CommonGrams to Solr for help with slow phrase queries
Hi Norberto, After working a bit on trying to port the Nutch CommonGrams code, I ran into lots of dependencies on Nutch and Hadoop. Would it be possible to get more information on how you use shingles (or code)? Are you creating shingles for all two word combinations or using a list of words? Tom i haven't used Nutch's implementation, but used the current implementation (1.3) of ngrams and shingles to address exactly the same issue ( database of music albums and tracks). We didn't notice any severe performance hit but : - data set isn't huge ( ca 1 MM docs). - reindexed nightly via DIH from MS-SQL, so we can use a separate cache layer to lower the number of hits to SOLR. B _ {Beto|Norberto|Numard} Meijome -- View this message in context: http://www.nabble.com/port-of-Nutch-CommonGrams-to-Solr-for-help-with-slow-phrase-queries-tp20666860p22382460.html Sent from the Solr - User mailing list archive at Nabble.com.