Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West

Thanks Lance and Michael,


We are running Solr 1.3.0.2009.09.03.11.14.39  (Complete version info from
Solr admin panel appended below)

I tried running CheckIndex (with the -ea:  switch ) on one of the shards.
CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger
segment containing 500K+ documents. (Complete CheckIndex output appended
below)

Is it likely that all 10 shards are corrupted?  Is it possible that we have
simply exceeded some lucene limit?

I'm wondering if we could have exceeded the lucene limit of unique terms of
2.1 billion as mentioned towards the end of the Lucene Index File Formats
document.  If the small 731 document index has nine million unique terms as
reported by check index, then even though many terms are repeated, it is
concievable that the 500,000 document index could have more than 2.1 billion
terms.

Do you know if  the number of terms reported by CheckIndex is the number of
unique terms?

On the other hand, we previously optimized a 1 million document index down
to 1 segment and had no problems.  That was with an earlier version of Solr
and did not include CommonGrams which could conceivably increase the number
of terms in the index by 2 or 3 times.


Tom
---

Solr Specification Version: 1.3.0.2009.09.03.11.14.39
Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 11:14:39
Lucene Specification Version: 2.9-dev
Lucene Implementation Version: 2.9-dev 779312 - 2009-05-27 17:19:55


[tburt...@slurm-4 ~]$  java -Xmx4096m  -Xms4096m -cp
/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib/lucene-core-2.9-dev.jar:/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib
-ea:org.apache.lucene... org.apache.lucene.index.CheckIndex
/l/solrs/1/.snapshot/serve-2010-02-07/data/index 

Opening index @ /l/solrs/1/.snapshot/serve-2010-02-07/data/index

Segments file=segments_zo numSegments=2 version=FORMAT_DIAGNOSTICS [Lucene
2.9]
  1 of 2: name=_29dn docCount=554799
compound=false
hasProx=true
numFiles=9
size (MB)=267,131.261
diagnostics = {optimize=true, mergeFactor=2,
os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_29dn_7.del]
test: open reader.OK [184 deleted docs]
test: fields, norms...OK [6 fields]
test: terms, freq, prox...FAILED
WARNING: fixIndex() would remove reference to this segment; full
exception:
java.lang.ArrayIndexOutOfBoundsException: -16777214
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246)
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218)
at
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57)
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:474)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:715)

  2 of 2: name=_29im docCount=731
compound=false
hasProx=true
numFiles=8
size (MB)=421.261
diagnostics = {optimize=true, mergeFactor=3,
os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.OK
test: fields, norms...OK [6 fields]
test: terms, freq, prox...OK [9504552 terms; 34864047 terms/docs pairs;
144869629 tokens]
test: stored fields...OK [3550 total field count; avg 4.856 fields
per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq
vector fields per doc]

WARNING: 1 broken segments (containing 554615 documents) detected
WARNING: would write new segments file, and 554615 documents would be lost,
if -fix were specified


[tburt...@slurm-4 ~]$ 


The index is corrupted. In some places ArrayIndex and NPE are not
wrapped as CorruptIndexException.

Try running your code with the Lucene assertions on. Add this to the
JVM arguments:  -ea:org.apache.lucene...


-- 
View this message in context: 
http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p27518800.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Thanks Robert!

2010-02-05 Thread Tom Burton-West

+1
And thanks to you both for all your work on CommonGrams!

Tom Burton-West


Jason Rutherglen-2 wrote:
> 
> Robert, thanks for redoing all the Solr analyzers to the new API!  It
> helps to have many examples to work from, best practices so to speak.
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Thanks-Robert%21-tp27460899p27472503.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Contributors - Solr in Action Case Studies

2010-01-20 Thread Tom Burton-West

Hello Otis,

Hi Otis,

We are using Solr to provide indexing for the full text of 5 million books
(About 4-6 terrabytes of text.)  Our index is currently around 3 terrabytes
distributed over 10 shards with about 310 GB of index per shard.  We are
using very large Solr documents (about 750MB of text or about 100,000
words/doc), and using CommonGrams to deal with stopwords/common words in
multiple languages.

I would be interested in contributing a chapter if this sounds interesting. 
More details about the project are available at: 
http://www.hathitrust.org/large_scale_search 
http://www.hathitrust.org/large_scale_search  and our blog: 
http://www.hathitrust.org/blogs/large-scale-search 
http://www.hathitrust.org/blogs/large-scale-search  (I'll be updating the
blog with details of current hardware and performance tests in the next week
or so)

Tom

Tom Burton-West
Digital Library Production Service
University of Michigan Library
-- 
View this message in context: 
http://old.nabble.com/Contributors---Solr-in-Action-Case-Studies-tp27166564p27249616.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Slow Phrase Queries

2009-10-20 Thread Tom Burton-West

You might try a couple tests in the Solr admin interface to make sure the
query is being processed the same in both Solr and raw lucene.  
1) use the analysis panel to determine if the Solr filter chain is doing
something unexpected compared to your lucene filter chain
2) try running a debug query from the Admin tool interface in Solr and then
in Lucene to see if the query is being parsed or otherwise interpreted
differently.

Tom 


DHast wrote:
> 
> Hello,
> I have recently installed Solr as an alternative to our home made lucene
> search servers, and while in most respects the performance is better, i
> notice that phrase searches are incredibly slow compared to normal lucene,
> primarily when using facets
> 
> example:
> "City of New York, Matter of" takes 11 seconds
> City of New York, Matter of takes 1 second
> 
> the same searches using raw lucene take 5 seconds and 3 seconds
> respectively.
> 
> i tried cutting out as much as i could from solrconfig without breaking
> it, is there anything else i could try doing to make solr perform
> similarly to raw lucene as far as phrase queries are concerned?
> thanks
> 

-- 
View this message in context: 
http://www.nabble.com/Slow-Phrase-Queries-tp2597p25980562.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Limit of Index size per machine..

2009-08-06 Thread Tom Burton-West

Hello,

I think you are confusing the size of the data you want to index with the
size of the index.  For our indexes (large full text documents) the Solr
index is about 1/3 of the size of the documents being indexed.  For 3 TB of
data you might have an index of 1 TB or less.  This depends on many factors
in your index configuration, including whether you store fields.

What kind of performance do you need for indexing time and for search
response time?

We are trying to optimize search response time and  have been running tests
on a 225GB Solr index with 32GB of ram and are getting 95% of our test
queries returning in less than a second.  However, the slowest 1% of queries
are returning 5 and 10 seconds.

On the other hand it takes almost a week to index about 670GB of full text
documents.

We will be scaling up to 3 million documents which will be about 2 TB of
text and 0.75 TB index size.  We plan to distribute the index across 5
machines.

More information on our setup and results is available
at:http://www.hathitrust.org/blogs/large-scale-search

Tom
> > The expected processed log file size per day: 100 GB
> > We are expecting to retain these indexes for 30 days
> (100*30 ~ 3 TB).


>>>That means we need approximately 3000 GB (Index Size)/24 GB (RAM) = 125
servers. 

It would be very hard to convince my org to go for 125 servers for log
management of 3 Terabytes of indexes. 

Has any one used, solr for processing and handling of the indexes of the
order of 3 TB ? If so how many servers were used for indexing alone.

Thanks,
sS

-- 
View this message in context: 
http://www.nabble.com/Limit-of-Index-size-per-machine..-tp24833163p24853662.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: port of Nutch CommonGrams to Solr for help with slow phrase queries

2009-03-06 Thread Tom Burton-West

Hi Norberto,

After working a bit on trying to port the Nutch CommonGrams code, I ran into
lots of dependencies on Nutch and Hadoop.  Would it be possible to get more
information on how you use shingles (or code)? Are you creating shingles for
all two word combinations or using a list of words?

Tom


i haven't used Nutch's implementation, but used the current implementation
(1.3) of ngrams and shingles to address exactly the same issue ( database of
music albums and tracks). 
We didn't notice any severe performance hit but :
- data set isn't huge ( ca 1 MM docs).
- reindexed nightly via DIH from MS-SQL, so we can use a separate cache
layer to
lower the number of hits to SOLR.

B
_
{Beto|Norberto|Numard} Meijome


-- 
View this message in context: 
http://www.nabble.com/port-of-Nutch-CommonGrams-to-Solr-for-help-with-slow-phrase-queries-tp20666860p22382460.html
Sent from the Solr - User mailing list archive at Nabble.com.



<    1   2