Re: Facets with an IDF concept

2009-06-23 Thread Kent Fitch
Hi Asif,

I was holding back because we have a similar problem, but we're not
sure how best to approach it, or even whether approaching it at all is
the right thing to do.

Background:
- large index (~35m documents)
- about 120k on these include full text book contents plus metadata,
the rest are just metadata
- we plan to increase number of full text books to around 1m, number
of records will greatly increase

We've found that because of the sheer volume of content in full text,
we get lots of results in full text of very low relevance. The Lucene
relevance ranking works wonderfully to "hide" these way down the list,
and when these are the only results at all, the user may be delighted
to find obscure hits.

But when you search for, say : soldier of fortune : one of the 55k+
results is Huck Finn, with 4 "soldier(s)" and 6 "fortunes", but it
probably isn't relevant.  The searcher will find it in the result
sets, but should the author, subject, dates, formats etc (our facets)
of Huck Finn be contributing to the facets shown to the user as
equally as, say, the top 500 results?  Maybe, but perhaps they are
"diluting" the value of facets contributed by the more relevant
results.

So, we are considering restricting the contents of the result bit set
used for faceting to exclude results with a very very low score (with
our own QueryComponent).  But there are problems:

- what's a low score?  How will a low score threshold vary across
queries? (Or should we use a rank cutoff instead, which is much more
expensive to compute, or some combo that works with results that only
have very low relevance results?)

- should we do this for all facets, or just some (where the less
relevant results seem particularly annoying, as they can "mask" facets
from the most relevant results - the authors, years and subjects we
have full text for are not representative of the whole corpus)

- if a searcher pages through to the 1000th result page, down to these
less relevant results, should we somehow include these results in the
facets we show?

sorry, only more questions!

Regards,

Kent Fitch

On Tue, Jun 23, 2009 at 5:58 PM, Asif Rahman wrote:
> Hi again,
>
> I guess nobody has used facets in the way I described below before.  Do any
> of the experts have any ideas as to how to do this efficiently and
> correctly?  Any thoughts would be greatly appreciated.
>
> Thanks,
>
> Asif
>
> On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman  wrote:
>
>> Hi all,
>>
>> We have an index of news articles that are tagged with news topics.
>> Currently, we use solr facets to see which topics are popular for a given
>> query or time period.  I'd like to apply the concept of IDF to the facet
>> counts so as to penalize the topics that occur broadly through our index.
>> I've begun to write custom facet component that applies the IDF to the facet
>> counts, but I also wanted to check if anyone has experience using facets in
>> this way.
>>
>> Thanks,
>>
>> Asif
>>
>
>
>
> --
> Asif Rahman
> Lead Engineer - NewsCred
> a...@newscred.com
> http://platform.newscred.com
>


best way to cache "base" queries (before application of filters)

2009-05-20 Thread Kent Fitch
Hi,  I'm looking for some advice on how to add "base query" caching to SOLR.

Our use-case for SOLR is:

- a large Lucene index (32M docs, doubling in 6 months, 110GB increasing x 8
in 6 months)
- a frontend which presents views of this data in 5 "categories" by firing
off 5 queries with the same search term but 5 different "fq" values

For example, an originating query for "sydney harbour" generates 5 SOLR
queries:

- ../search?q=&fq=category:books
- ../search?q=&fq=category:maps
- ../search?q=&fq=category:music
etc

The complicated expansion requiring sloppy phrase matches, and the large
database with lots of very large documents means that some queries take
quite some time (10's to several 100's of ms), so we'd like to cache the
results of the base query for a short time (long enough for all related
queries to be issued).

It looks like this isnt the use-case for queryResultCache, because its key
is calculated in SolrIndexSearcher like this:

key = new QueryResultKey(cmd.getQuery(), cmd.getFilterList(), cmd.getSort(),
cmd.getFlags());

That is, the filters are part of the key; and the result that's cached
results reflects the application of the filters, and this works great for
what it is probably designed for - supporting paging through results.

So, I think our options are:

- create a new queryComponent that invokes SolrIndexSearcher differently,
and which has its own (short lived but long entry length) cache of the base
query results

- subclass or change SolrIndexSearcher, perhaps making it "pluggable",
perhaps defining an optional new cache of base query results

- create a sublcass of the Lucene IndexSearcher which manages a cache of
query results "hidden" from SolrIndexSearcher (and organise somehow for
SolrIndexSearcher to use that sublass)

Or perhaps Im taking the wrong approach to this problem entirely!  Any
advice is greatly appreciated.

Kent Fitch


Re: best way to cache "base" queries (before application of filters)

2009-05-21 Thread Kent Fitch
Thanks for your reply, Yonik:

On Thu, May 21, 2009 at 2:43 AM, Yonik Seeley
 wrote:
>
> Some thoughts:
>
> #1) This is sort of already implemented in some form... see this
> section of solrconfig.xml and try uncommenting it:
> ...

> > On Wed, May 20, 2009 at 12:43 PM, Yonik Seeley
> > >  wrote:
> > >true

> > Of course the examples you gave used the default sort (by score) so
> > this wouldn't help if you do actually need to sort by score.

Right - we need to sort by relevance

> #2) Your problem might be able to be solved with field collapsing on
> the "category" field in the future (but it's not in Solr yet).

Sorry - I didnt understand this

> #3) Current work I'm doing right now will push Filters down a level
> and check them in tandem with the query instead of after.  This should
> speed things up by at least a factor of 2 in your case.
> https://issues.apache.org/jira/browse/SOLR-1165
>
> I'm trying to get SOLR-1165 finished this week, and I'd love to see
> how it affects your performance.
> In the meantime, try useFilterForSortedQuery and let us know if it
> still works (it's been turned off for a lng time) ;-)

OK - so this looks like something to make all queries much faster by
only bothering to score results matching a filter?  If so, that's
really great, but I'm not sure it particularly helps our use-case
(other than making all filtered results faster) because:

- we've got one query we want filtered 5 ways to find the top scoring
results matching the query and each filter

- the filtering basically divides that query result set into 5 non
overlapping sets

- the query part is often complicated and expensive - we want to avoid
running it 5 times because our sloppy phrase requirement and often
millions of hits make finding and scoring expensive

- all documents in the query part will be scored eventually, even with
SOLR-1165, because they'll be part of one of the 5 filters

It is tempting to pass back to a custom query component lots of
results - enough so that the 'n' top scoring document that satisfy
each filter appear, but we may need to pass up to the query component
millions of hits to find, say, the top 5 ranked results for "maps".

It is tempting to apply the filters one by one in our own query
component on a scored document list retrieved by SolrIndexSearcher -
Im not sure - maybe I havent understood SOLR-1165?

Thanks also Walter for your suggestions.  Our users have a requirement
for the index to be continuously updated (well, every 10 minutes or
so), and our queries are extremely diverse/"long tail"ish, so an HTTP
cache will probably not help us.

Kent Fitch


UnInvertedField performance on faceted fields containing many unique terms

2009-06-15 Thread Kent Fitch
Hi,

This may be of interest to other users of SOLR's UnInvertedField who
have a very large number of unique terms in faceted fields.

Our setup is :

- about 34M lucene documents of bibliographic and full text content
- index currently 115GB, will at least double over next 6 months
- moving to support real-time-ish updates (maybe 5 min delay)

We facet on 8 fields, 6 of which are "normal" with small numbers of
distinct values.  But 2 faceted fields, creator and subject, are huge,
with 18M and 9M terms respectively.  (Whether we should be faceting on
such a huge number of values, and at the same time attempting to
provide real time-ish updates is another question!  Whether facets
derived from all of the hundreds of thousands of results regardless of
match quality which typically happens in a large full text index is
yet another question!).  The app is visible here:
http://sbdsproto.nla.gov.au/

On a server with 2xquad core AMD 2382 processors and 64GB memory, java
1.6.0_13-b03, 64 bit run with "-Xmx15192M -Xms6000M -verbose:gc", with
the index on Intel X25M SSD, on start-up the elapsed time to create
the 8 facets is 306 seconds (best time).  Following an index reopen,
the time to recreate them in 318 seconds (best time).

[We have made an independent experimental change to create the facets
with 3 async threads, that is, in parallel, and also to decouple them
from the underlying index, so our facets lag the index changes by the
time to recreate the facets.  With our setup, the 3 threads reduced
facet creation elapsed time from about 450 secs to around 320 secs,
but this will depend a lot on IO capabilities of the device containing
the index, amount of file system caching, load, etc]

Anyway, we noticed that huge amounts of garbage were being collected
during facet generation of the creator and subject fields, and tracked
it down to this decision in UnInvertedField univert():

  if (termNum >= maxTermCounts.length) {
// resize, but conserve memory by not doubling
// resize at end??? we waste a maximum of 16K (average of 8K)
int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
maxTermCounts = newMaxTermCounts;
  }

So, we tried the obvious thing:

- allocate 10K terms initially, rather than 1K
- extend by doubling the current size, rather than adding a fixed 4K
- free unused space at the end (but only if unused space is
"significant") by reallocating the array to the exact required size

And also:

- created a static HashMap lookup keyed on field name which remembers
the previous allocated size for maxTermCounts for that field, and
initially allocates that size + 1000 entries

The second change is a minor optimisation, but the first change, by
eliminating thousands of array reallocations and copies, greatly
improved load times, down from 306 to 124 seconds on the initial load
and from 318 to 134 seconds on reloads after index updates.  About
60-70 secs is still spend in GC, but it is a significant improvement.

Unless you have very large numbers of facet values, this change won't
have any positive benefit.

Regards,

Kent Fitch


Re: UnInvertedField performance on faceted fields containing many unique terms

2009-06-15 Thread Kent Fitch
Hi Yonik,

On Tue, Jun 16, 2009 at 10:52 AM, Yonik
Seeley wrote:

> All the constants you see in UnInvertedField were a best guess - I
> wasn't working with any real data.  It's surprising that a big array
> allocation every 4096 terms is so significant - I had figured that the
> work involved in processing that many terms would far outweigh
> realloc+GC.

Well, they were pretty good guesses!  The code is extremely fast for
"reasonable" sized term lists.
I think with our 18M terms, the increasingly long array of ints was
being reallocated, copied and garbage collected 18M/4K = 4,500 times,
creating 4500x(18Mx4bytes)/2 = 162GB of garbage to collect.

> Could you open a JIRA issue with your recommended changes?  It's
> simple enough we should have no problem getting it in for Solr 1.4.

Thanks - just added SOLR-1220.  I havent mentioned the change to the
initial allocation on 10K (rather than 1024) because I dont think it
is significant.  I also havent mentioned the remembering of sizes to
initially allocate, because the improvement is marginal compared to
this big change, and for all I know, a static hashmap with fieldnames
could cause unwanted side effects from field name clashes if running
SOLR with multiple indices.

> Also, are you using a recent Solr build (within the last month)?
> LUCENE-1596 should improve uninvert time for non-optimized indexes.

We're not - but we'll upgrade to the latest version of 1.4 very soon.

> And don't forget to update http://wiki.apache.org/solr/PublicServers
> when you go live!

We will - thanks for your great work in improving SOLR performance
with 1.4 which makes such outrageous uses of facets even thinkable.

Regards,

Kent Fitch


Re: Solr and jvm Garbage Collection tuning

2010-09-10 Thread Kent Fitch
Hi Tim,

For what it is worth,  behind Trove (http://trove.nla.gov.au/) are 3
SOLR-managed indices and 1 Lucene index. None of ours is as big as one
of your shards, and one of our SOLR-managed indices is tiny, but your
experiences with long GC pauses are familar to us.

One of the most difficult indices to tune is our bibliographic index
of around 38M mostly metadata records which is around 125GB and 97MB
tii files.

We need to commit updates and reopen the index every 90 seconds, and
the facet recalculation (using UnInverted) was taking quite a lot of
time, and seemed to generate lots of objects to be collected on each
reopening.

Although we've been through several rounds of tuning which have seemed
to work, at least temporarily, a few months ago we started getting 12
sec "full gc" times every 90 secs, which was no good!

We've noticed/did three things:

1) optimise to 1 segment - we'd got to the stage where 50% of the
documents had been updated (hence deleted), and the maxdocid was 50%
bigger than it needed to be, and hence datastructures whose size was
proportional to maxdocid had increased a lot.  Optimising to 1 segment
greatly reduced full GC frequency and times.

2) for most of our facets, forcing the facets to be filters rather
than uninverted happened to work better - but this depends on many
factors, and certainly isnt a cure-all for all facets - uninverted
often works much better than filters!

3) after lots of benchmarking real updates and queries on a dev
system, we came up with this set of JVM parameters that worked "best"
for our environment (at the moment!):

-Xmx17000M -XX:NewSize=3500M -XX:SurvivorRatio=3
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC \
-XX:+CMSIncrementalMode

I can't say exactly why, except that with this combination of
parameters and our data, a much bigger newgen led to less movement of
objects to oldgen, and non-full-GC collections on oldgen worked much
better.  Currently we are seeing less than 10 Full GC's a day, and
they almost always take less than 4 seconds.

This index is running on an 8 core X5570 machine with 64GB, sharing it
with a large/busy mysql instance and the Trove web server.

One of our other indices is only updated once per day, but is larger:
33.5M docs representing full text of archived web pages, 246GB, tii
file is 36MB.

JVM parms are  -Xmx1M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC.

It also does less than 10 Full GC's per day, taking less than 5 sec each.

Our other large index, newspapers, is a native Lucene index, about
180GB with comparatively large tii of 280MB (probably for the same
reason your tii is large - the contents of this database is mostly
OCR'ed text).  This index is updated/reopened every 3 minutes (to
incorporate OCR text corrections and tagging) and we use a bitmap to
represent all facet values, which typically take 5 secs to rebuild on
each reopen.

JVM parms: -mx15000M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

Although this JVM usually does fewer than 5 GC's per day, these Full
GC's often take 20-30 seconds, and we need to test increasing the
Newsize on this JVM to see if we can reduce these pauses.

The web archive and newspaper index are running on 8 core X5570
machine with 72GB.

We are also running a separate copy/version of this index behind the
site  http://newspapers.nla.gov.au/ - the main difference is that the
Trove version using shingling (inspired by the Hathi Trust results) to
improve searches containing common words.  This other version is
running on a machine with 32GB and 8 X5460 cores and  has JVM parms:
  -mx11500M  -XX:+UseConcMarkSweepGC -XX:+UseParNewGC


Apart from the old newspapers index, all other SOLR/lucene indices are
maintained on SSDs (Intel x25m 160GB), which whilst not having
anything to do with GCs, work very very well - we couldnt cope with
our current query volumes on rotating disk without spending a great
deal of money.  The old newspaper index is running on a SAN with 24
fast disks backing it, and we can't support the same query rate on it
as we can with the other newspaper index on SSDs (even before the
shingling change).

Kent Fitch
Trove development team
National Library of Australia


Re: high CPU usage and SelectCannelConnector threads used a lot

2010-12-06 Thread Kent Fitch
Hi John,

sounds like this bug in NIO:

http://jira.codehaus.org/browse/JETTY-937

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6403933

I think recent versions of jetty work around this bug, or maybe try
the non-NIO socket connector

Kent

On Tue, Dec 7, 2010 at 9:10 AM, John Russell  wrote:
> Hi,
> I'm using solr and have been load testing it for around 4 days.  We use the
> solrj client to communicate with a separate jetty based solr process on the
> same box.
>
> After a few days solr's CPU% is now consistently at or above 100% (multiple
> processors available) and the application using it is mostly not responding
> because it times out talking to solr. I connected visual VM to the solr JVM
> and found that out of the many btpool-# threads there are 4 that are pretty
> much stuck in the running state 100% of the time. Their names are
>
> btpool0-1-Acceptor1 SelectChannelConnector @0.0.0.0:9983
> btpool0-2-Acceptor2 SelectChannelConnector @0.0.0.0:9983
> btpool0-3-Acceptor3 SelectChannelConnector @0.0.0.0:9983
> btpool0-9-Acceptor0 SelectChannelConnector @0.0.0.0:9983
>
>
>
> The stacks are all the same
>
>    "btpool0-2 - Acceptor2 SelectChannelConnector @ 0.0.0.0:9983" - Thread
> t...@27
>    java.lang.Thread.State: RUNNABLE
>        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>        - locked <106a644> (a sun.nio.ch.Util$1)
>        - locked <18dd381> (a java.util.Collections$UnmodifiableSet)
>        - locked <38d07d> (a sun.nio.ch.EPollSelectorImpl)
>        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>        at
> org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:419)
>        at
> org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:169)
>        at
> org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
>        at
> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516)
>        at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>
>       Locked ownable synchronizers:
>        - None
>
> All of the other idle thread pool threads are just waiting for new tasks.
> The active threads never seem to change, its always these 4.  The selector
> channel appears to be in the jetty code, receiving requests from our other
> process through the solrj client.
>
> Does anyone know what this might mean or how to address it? Are these
> running all the time because they are blocked on IO so not actually
> consuming CPU? If so, what else might be? Is there a better way to figure
> out what is pinning the CPU?
>
> Some more info that might be useful.
>
> 32 bit machine ( I know, I know)
> 2.7GB of RAM for solr process ~2.5 is "used"
> According to visual VM around 25% of CPU time is spent in GC with the rest
> in application.
>
> Thanks for the help.
>
> John
>