RE: Error with bin/optimize and multiple solr webapps

2007-03-06 Thread Graham Stead
Apologies in advance if SOLR-187 and SOLR-188 look the same -- they are the
same issue. I have been using adjusted scripts locally but hadn't used Jira
before and wasn't sure of the process. I decided to figure it out after
answering Gola's question this morning...then saw that Jeff had mentioned a
similar issue last night. I apologize again for confusion over the double
entry. 

Thanks,
-Graham

> -Original Message-
> From: Jeff Rodenburg [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, March 06, 2007 4:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Error with bin/optimize and multiple solr webapps
> 
> This issue has been logged as:
> 
> https://issues.apache.org/jira/browse/SOLR-188
> 
> A patch file is included for those who are interested.  I've 
> unit tested in my environment, please validate it for your 
> own environment.
> 
> cheers,
> j
> 
> 
> 
> On 3/5/07, Jeff Rodenburg <[EMAIL PROTECTED]> wrote:
> >
> > Thanks Hoss.  I'll add an issue in JIRA and attach the patch.
> >
> >
> >
> > On 3/5/07, Chris Hostetter <[EMAIL PROTECTED] > wrote:
> > >
> > >
> > > : This line assumes a single solr installation under 
> Tomcat, whereas 
> > > the
> > >
> > > : multiple webapp scenario runs from a different location 
> (the "/solr"
> > > part).
> > > : I'm sure this applies elsewhere.
> > >
> > > good catch ... it looks like all of our scripts assume 
> > > "/solr/update" is
> > >
> > > the correct path to POST commit/optimize messages to.
> > >
> > > : I would submit a patch for JIRA, but couldn't find these files 
> > > under version
> > > : control.  Any recommendations?
> > >
> > > They live in src/scripts ... a patch would ceritanly be 
> apprecaited.
> > >
> > > FYI: there is an evolution underway to allow XML based update 
> > > messages to be sent to any path (and the fixed path "/update" is 
> > > being deprecated) so it would be handy if the entire URL path was 
> > > configurable (not just hte webapp name)
> > >
> > >
> > > -Hoss
> > >
> > >
> >
> 




RE: Time after snapshot is "visible" on the slave

2007-03-06 Thread Graham Stead
I forgot to mention that the admin page (solr/admin/stats.jsp) is an
excellent way to see when the last searcher was opened. After running
commit, you should see update to the openedAt and registeredAt timestamps,
e.g.,:

openedAt : Tue Mar 06 08:14:19 PST 2007
registeredAt : Tue Mar 06 08:15:55 PST 2007 

If you have added documents, you'll numDocs and/or maxDoc change as well.

If you don't see these update then something isn't right. If you see them
update but cannot find your documents in the index, then your indexing
process may not be working correctly.

Hope this helps,
-Graham

PS: If you are running replication with multiple solr instances, your
problem may be caused by a simple bug in the commit, optimize, and
readercycle scripts. Replace the /solr/ in the curl statement with
${webapp_name}:

From:
rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d ""`

To:
rs=`curl http://${solr_hostname}:${solr_port}/${webapp_name}/update -s -d
""`

I haven't had time to commit these bug fixes yet.




RE: Time after snapshot is "visible" on the slave

2007-03-06 Thread Graham Stead
Hi Galo,

The snapinstaller actually performs a commit as its last step, so if that
didn't work, it's not surprising that running commit separately didn't work,
either.

I would suggest running the snapinstaller and/or commit scripts with the -V
option. This will produce verbose debugging information and allow you to see
where they encounter problems.

Hope this helps,
-Graham




RE: 'accumulate' copyField for faceting

2007-03-01 Thread Graham Stead
Sorry for interloping, but I have been wondering the same thing as Ryan. On
my current index with ~6.1M docs, I restarted Solr and ran a query that
included faceting on 4 fields:

QTime: 5712
numFound: 25908
filterCache stats:
lookups : 0
hits : 0
hitratio : 0.00
inserts : 1
evictions : 0
size : 1
cumulative_lookups : 0
cumulative_hits : 0
cumulative_hitratio : 0.00
cumulative_inserts : 1
cumulative_evictions : 0 

Then I added faceting on a 5th, multivalued field:

QTime: 65551
numFound: 25908
Filtercache stats:
lookups : 1898314
hits : 1
hitratio : 0.00
inserts : 1898314
evictions : 1897802
size : 512
cumulative_lookups : 1898314
cumulative_hits : 1
cumulative_hitratio : 0.00
cumulative_inserts : 1898314
cumulative_evictions : 1897802


I realize there are a lot of different values in the 5th multivalued field.
But this is where I'm fuzzy: are we saying there would be no difference
using a tokenized, single valued field versus a multivalued field? Or are we
saying that multivalued is ok, as long as the number of values is less than
the filterCache size? [Unfortunately I don't have a single valued version of
this field to test with]

Thanks,
-Graham

> I'll be interested in seeing some numbers.  The number of 
> documents matching the base query and filters will also 
> factor in (small will be HashDocSet, large will be BitDocSet).
> 
> Just make sure to run all of your facets, then check the 
> statistics page to see how big you need to make the 
> filterCache to hold them all (and add a little extra for 
> random filters).  The access pattern for the faceting code is 
> worst case for the LRU cache, so it needs to avoid any evictions.
> 
> -Yonik




RE: Incremental replication...

2007-02-13 Thread Graham Stead
We have used replication for a few weeks now and it generally works well.

I believe you'll find that commit operations cause only new segments to be
transferred, whereas optimize operations cause the entire index to be
transferred. Therefore, the amount of data transferred really depends on how
frequently you index new data and how often you call  and
.

Hope this helps,
-Graham




RE: Gentoo: problem with xml-apis.jar/Apache Tomcat Native Library

2007-02-12 Thread Graham Stead
I'm afraid I don't have the answer, I can only add that we also had this
problem. We later installed the official Tomcat binary, but still get the
"optimal performance in production environments" error notification.

-Graham




RE: Debugging Solr memory usage/heap problems

2007-02-06 Thread Graham Stead

Thanks, Chris. I will test with vanilla Solr to clarify whether the problem
occurs with it, or only in the version where we have made changes.

-Graham

> : To tweak our scoring, a custom hit collector in 
> SolrIndexSearcher creates 1
> : fieldCache and 3 ValueSources from 3 fields:
> : - an integer field with many unique values (order 10^4)
> : - another integer field with many unique values (order 10^4)
> : - an integer field with hundreds of unique values
> 
> so you customized SolrIndexSearcher? ... is it possible you 
> have a memory leak in that code?
> 
> If you have all of your cache sizes set to zero, you should 
> be able to start up the server, hit it with a bunch of 
> queries, then trigger a commit and see your heap usage drop 
> significantly. ... if you do that over and over again and see 
> the heap usage grow and grow, there may be something else 
> going on in those changes of yours.




RE: Debugging Solr memory usage/heap problems

2007-02-06 Thread Graham Stead

> > Our queries do not sort by any field. However, we do make use of 
> > FunctionQueries and a typical query is something like:
> >
> > users_query AND (+linear_function_query 
> +recip_function_query
> > +language:english^0 -flags:spam^0)
> 
> Function queries often build fieldCaches--on how many fields 
> do you use function queries, and how big is the set of unique 
> values for those fields?

2 fields:
- date string with hundreds of unique values
- an integer field with < 250 unique values

To tweak our scoring, a custom hit collector in SolrIndexSearcher creates 1
fieldCache and 3 ValueSources from 3 fields:
- an integer field with many unique values (order 10^4)
- another integer field with many unique values (order 10^4)
- an integer field with hundreds of unique values

I thought a function query used ValueSource, so perhaps usage is similar in
both cases. Would a ValueSource load all values into memory, or just unique
ones?

> Is user_query a string of keywords, or is it an arbitrary 
> query in lucene syntax?

It's whatever the user types into a search box (supports arbitrary lucene).
Some queries are intentionally harsh, like 'george OR bush' or 'the OR at'.
The latter matches virtually every document in the index.

Thanks again,
-Graham




RE: Debugging Solr memory usage/heap problems

2007-02-06 Thread Graham Stead
Mike, Yonik, thanks for the quick reply. 
 
> I think it is in your queries.  Are you sorting on many 
> fields?  What is a typical query?  I'm not a lucene expert, 
> but there are lucene experts on this list.

Our queries do not sort by any field. However, we do make use of
FunctionQueries and a typical query is something like:

users_query AND (+linear_function_query +recip_function_query
+language:english^0 -flags:spam^0)

> 2) If your stored fields are very large, try reducing the 
> size of the doc cache.

Is this what you mean? I'm testing with:


> During warming, there are *two* searchers open, so double the 
> number for things like the FieldCache.  If you can accept 
> slow first queries (like maybe in an offline query system) 
> then you can turn off all warming.

Good point. I already tried to eliminate warming problems like this:




I know these changes make things slow, but I'm trying to eliminate as many
variables as possible.

I agree with Mike that the problem must be searches -- after all, the Solr
master works fine and it doesn't host searches. Is there a rule of thumb to
guesstimate the SolrIndexSearcher memory requirements?

Thanks again,
-Graham




Debugging Solr memory usage/heap problems

2007-02-06 Thread Graham Stead
Hi everyone,
 
My Solr JVM runs out of heap space quite frequently. I'm trying to
understand Solr/Lucene's memory usage so I can address the problem
correctly. Otherwise, I feel I'm taking random shots in the dark.
 
I've tried previous troubleshooting suggestions. Here's what I've done:
 
1) Increased Tomcat's JVM heap space, e.g.:
JAVA_OPTS='-Xmx1244m -Xms1244m -server'; # frequent heap space problems
JAVA_OPTS='-XX:+AggressiveHeap -server'; # runs out of heap space at
2.0g
JAVA_OPTS='-Xmx3072m -Xms3072m -server'; # jvm quickly hits 2.9g on
'top'
 
Solr is the only webapp deployed on this Tomcat instance.
 
2) I use Solr collection/distribution to separate indexing and searching.
The indexer is stable now and memory problems only occur when searching on
the Solr slave.
 
3) In solrconfig.xml, I reduced mergeFactor and maxBufferedDocs by 50%:
5
500
 
This helped the indexing server but not the Solr slave.
 
4) In solrconfig.xml, I set filterCache, queryResultCache, and documentCache
to 0.
 
Now for my index details: 
- To facilitate highlighting, I currently store doc contents in the index,
so the index consumes 24GB on disk.
- numDocs : 4,953,736 
  maxDoc : 4,953,736 (just optimized)
- Term files:
   logs # du -ksh ../solr/data/index/*.t??
   5.9M../solr/data/index/_1kjb.tii
   429M../solr/data/index/_1kjb.tis
- I have 22 fields and yes, they currently have norms.

Other info that may be helpful:
- My Solr is from 2006-11-15. We have a few mods, including one extra
fieldCache that stores ~40 bytes/doc.
- Thread counts from solr/admin/threaddump.jsp:
  Java HotSpot(TM) 64-Bit Server VM 1.5.0_08-b03
  Thread Count: current=37 deamon=34 peak=37
 
My machine has Gentoo Linux and 4gb RAM. 'top' indicates the JVM reaches
2.9g RAM (3472m virtual memory) after 10-20 searches and ~20 mins of use. It
seems just a matter of time before more searches or a snapinstaller 'commit'
will make it run out of heap space again.
 
I have flexibility in the changes we can make. I.e., I can omit norms for
most fields, or I can stop storing the doc contents in the index. But before
embarking on a new strategy, I need some assurance that the strategy will
work (crazy, I know). For example, it doesn't seem that removing norms would
save a great deal (I calculate saving 1 byte per norm per field on 21 fields
is ~99MB).
 
So...how do I deduce what's taking up so much memory? Any suggestions would
be very helpful to me (and hopefully to others, too).
 
many thanks,
-Graham


RE: Function boosts...

2006-12-29 Thread Graham Stead
I believe two concepts are getting slightly mixed here: the
LinearFloatFunction, which is a Solr FunctionQuery, and the original Lucene
scoring methodology. FunctionQueries are not part of vanilla Lucene, so you
will not explicitly see them mentioned in the Lucene similarity documents.

The best way to understand how FunctionQueries are applied is to use the
Solr explanations (&debugQuery=1, I believe).

>From my experience, each Function Query you add is treated as another term
in the summation. E.g., if the search query has 2 terms and 1 function query
is added, you will see 3 terms summed to yield the score. The function query
result is multiplied by queryNorm(q), making the effect a bit hard to
predict sometimes.

Hope this helps,
-Graham

> -Original Message-
> From: escher2k [mailto:[EMAIL PROTECTED]
> Sent: Thursday, December 28, 2006 3:20 PM
> To: solr-user@lucene.apache.org
> Subject: Function boosts...
>
>
> I had a question about the way boosting works  - is it a
> final boost on the score that is returned ?
> For instance, in the LinearFloatFunction
> (LinearFloatFunction(ValueSource source, float slope, float
> intercept)), is the ValueSource is the "core" score returned
> by Lucene that gets boosted.
>
> From,
> http://lucene.apache.org/java/docs/api/org/apache/lucene/searc
> h/Similarity.html
> score(q,d)   =   coord(q,d)  ・  queryNorm(q)  ・   \xAD\xF4  ( tf(t in d)  ・
> idf(t)2  ・  t.getBoost() ・  norm(t,d) ) So, is ValueSource
> really score(q,d) and hence LinearFloatFunction does, Final
> Score = score(q,d) * slope + intercept ?
>
> Thanks.
> --
> View this message in context:
> http://www.nabble.com/Function-boosts...-tf2892636.html#a8081654
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>