Re: Solr memory use, jmap and TermInfos/tii

2010-09-11 Thread Simon Willnauer
On Sun, Sep 12, 2010 at 1:51 AM, Michael McCandless
 wrote:
> On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom  wrote:
>>  Is there an example of how to set up the divisor parameter in 
>> solrconfig.xml somewhere?
>
> Alas I don't know how to configure terms index divisor from Solr...

You can set the termIndexInterval via


...
128
...


which has the same effect but requires reindexing. I don't see that
the index divisor is exposed but maybe we should do so!

simon
In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
parallel arrays instead of separate objects, and,
we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will 
show this gain...;
>>
>> I'm looking forward to a number of the developments in 4.0, but am a bit 
>> wary of using it in production.   I've wanted to work in some tests with 
>> 4.0, but other more pressing issues have so far prevented this.
>
> Understood.
>
>> What about Lucene 2205?  Would that be a way to get some of the benefit 
>> similar to the changes in flex without the rest of the changes in flex and 
>> 4.0?
>
> 2205 was a similar idea (don't create tons of small objects), but it
> was never committed...
>
I'd be really curious to test the RAM reduction in 4.0 on your terms  
dict/index --
is there any way I could get a copy of just the tii/tis  files in your 
index?  Your index is a great test for Lucene!
>>
>> We haven't been able to make much data available due to copyright and other 
>> legal issues.  However, since there is absolutely no way anyone could 
>> reconstruct copyrighted works from the tii/tis index alone, that should be 
>> ok on that front.  On Monday I'll try to get legal/administrative clearance 
>> to provide the data and also ask around and see if I can get the ok to 
>> either find a spare hard drive to ship, or make some kind of sftp 
>> arrangement.  Hopefully we will find a way to be able to do this.
>
> That would be awesome, thanks!
>
>> BTW  Most of the terms are probably the result of  dirty OCR and the impact 
>> is probably increased by our present "punctuation filter".  When we re-index 
>> we plan to use a more intelligent filter that will truncate extremely long 
>> tokens on punctuation and we also plan to do some minimal prefiltering prior 
>> to sending documents to Solr for indexing.  However, since with now have 
>> over 400 languages , we will have to be conservative in our filtering since 
>> we would rather  index dirty OCR than risk not indexing legitimate content.
>
> Got it... it's a great test case for Lucene :)
>
> Mike
>


Re: Solr and jvm Garbage Collection tuning

2010-09-11 Thread Dennis Gearon
Thanks for the real life examples.

You would have to do a LOT of sharding to get that to work better.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/10/10, Kent Fitch  wrote:

> From: Kent Fitch 
> Subject: Re: Solr and jvm Garbage Collection tuning
> To: solr-user@lucene.apache.org
> Date: Friday, September 10, 2010, 10:45 PM
> Hi Tim,
> 
> For what it is worth,  behind Trove (http://trove.nla.gov.au/) are 3
> SOLR-managed indices and 1 Lucene index. None of ours is as
> big as one
> of your shards, and one of our SOLR-managed indices is
> tiny, but your
> experiences with long GC pauses are familar to us.
> 
> One of the most difficult indices to tune is our
> bibliographic index
> of around 38M mostly metadata records which is around 125GB
> and 97MB
> tii files.
> 
> We need to commit updates and reopen the index every 90
> seconds, and
> the facet recalculation (using UnInverted) was taking quite
> a lot of
> time, and seemed to generate lots of objects to be
> collected on each
> reopening.
> 
> Although we've been through several rounds of tuning which
> have seemed
> to work, at least temporarily, a few months ago we started
> getting 12
> sec "full gc" times every 90 secs, which was no good!
> 
> We've noticed/did three things:
> 
> 1) optimise to 1 segment - we'd got to the stage where 50%
> of the
> documents had been updated (hence deleted), and the
> maxdocid was 50%
> bigger than it needed to be, and hence datastructures whose
> size was
> proportional to maxdocid had increased a lot. 
> Optimising to 1 segment
> greatly reduced full GC frequency and times.
> 
> 2) for most of our facets, forcing the facets to be filters
> rather
> than uninverted happened to work better - but this depends
> on many
> factors, and certainly isnt a cure-all for all facets -
> uninverted
> often works much better than filters!
> 
> 3) after lots of benchmarking real updates and queries on a
> dev
> system, we came up with this set of JVM parameters that
> worked "best"
> for our environment (at the moment!):
> 
> -Xmx17000M -XX:NewSize=3500M -XX:SurvivorRatio=3
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC \
> -XX:+CMSIncrementalMode
> 
> I can't say exactly why, except that with this combination
> of
> parameters and our data, a much bigger newgen led to less
> movement of
> objects to oldgen, and non-full-GC collections on oldgen
> worked much
> better.  Currently we are seeing less than 10 Full
> GC's a day, and
> they almost always take less than 4 seconds.
> 
> This index is running on an 8 core X5570 machine with 64GB,
> sharing it
> with a large/busy mysql instance and the Trove web server.
> 
> One of our other indices is only updated once per day, but
> is larger:
> 33.5M docs representing full text of archived web pages,
> 246GB, tii
> file is 36MB.
> 
> JVM parms are  -Xmx1M -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC.
> 
> It also does less than 10 Full GC's per day, taking less
> than 5 sec each.
> 
> Our other large index, newspapers, is a native Lucene
> index, about
> 180GB with comparatively large tii of 280MB (probably for
> the same
> reason your tii is large - the contents of this database is
> mostly
> OCR'ed text).  This index is updated/reopened every 3
> minutes (to
> incorporate OCR text corrections and tagging) and we use a
> bitmap to
> represent all facet values, which typically take 5 secs to
> rebuild on
> each reopen.
> 
> JVM parms: -mx15000M -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC
> 
> Although this JVM usually does fewer than 5 GC's per day,
> these Full
> GC's often take 20-30 seconds, and we need to test
> increasing the
> Newsize on this JVM to see if we can reduce these pauses.
> 
> The web archive and newspaper index are running on 8 core
> X5570
> machine with 72GB.
> 
> We are also running a separate copy/version of this index
> behind the
> site  http://newspapers.nla.gov.au/ - the main
> difference is that the
> Trove version using shingling (inspired by the Hathi Trust
> results) to
> improve searches containing common words.  This other
> version is
> running on a machine with 32GB and 8 X5460 cores and 
> has JVM parms:
>   -mx11500M  -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC
> 
> 
> Apart from the old newspapers index, all other SOLR/lucene
> indices are
> maintained on SSDs (Intel x25m 160GB), which whilst not
> having
> anything to do with GCs, work very very well - we couldnt
> cope with
> our current query volumes on rotating disk without spending
> a great
> deal of money.  The old newspaper index is running on
> a SAN with 24
> fast disks backing it, and we can't support the same query
> rate on it
> as we can with the other newspaper index on SSDs (even
> before the
> shingling change).
> 
> Kent Fitch
> Trove development team
> National Library of Australia
>


Re: Solr memory use, jmap and TermInfos/tii

2010-09-11 Thread Michael McCandless
On Sat, Sep 11, 2010 at 11:07 AM, Burton-West, Tom  wrote:
>  Is there an example of how to set up the divisor parameter in solrconfig.xml 
> somewhere?

Alas I don't know how to configure terms index divisor from Solr...

>>>In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
>>>parallel arrays instead of separate objects, and,
>>>we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will show 
>>>this gain...;
>
> I'm looking forward to a number of the developments in 4.0, but am a bit wary 
> of using it in production.   I've wanted to work in some tests with 4.0, but 
> other more pressing issues have so far prevented this.

Understood.

> What about Lucene 2205?  Would that be a way to get some of the benefit 
> similar to the changes in flex without the rest of the changes in flex and 
> 4.0?

2205 was a similar idea (don't create tons of small objects), but it
was never committed...

>>>I'd be really curious to test the RAM reduction in 4.0 on your terms  
>>>dict/index --
>>>is there any way I could get a copy of just the tii/tis  files in your 
>>>index?  Your index is a great test for Lucene!
>
> We haven't been able to make much data available due to copyright and other 
> legal issues.  However, since there is absolutely no way anyone could 
> reconstruct copyrighted works from the tii/tis index alone, that should be ok 
> on that front.  On Monday I'll try to get legal/administrative clearance to 
> provide the data and also ask around and see if I can get the ok to either 
> find a spare hard drive to ship, or make some kind of sftp arrangement.  
> Hopefully we will find a way to be able to do this.

That would be awesome, thanks!

> BTW  Most of the terms are probably the result of  dirty OCR and the impact 
> is probably increased by our present "punctuation filter".  When we re-index 
> we plan to use a more intelligent filter that will truncate extremely long 
> tokens on punctuation and we also plan to do some minimal prefiltering prior 
> to sending documents to Solr for indexing.  However, since with now have over 
> 400 languages , we will have to be conservative in our filtering since we 
> would rather  index dirty OCR than risk not indexing legitimate content.

Got it... it's a great test case for Lucene :)

Mike


Re: Solr memory use, jmap and TermInfos/tii

2010-09-11 Thread Lance Norskog
There is a trick: facets with only one occurrence tend to be mispellings 
or dirt. You write a program to fetch the terms (Lucene's CheckIndex is 
a great starting point) create a stopwords file.


Here's a data mining project: which languages are more vulnerable to 
dirty OCR?


Burton-West, Tom wrote:

Thanks Mike,

   

Do you use a terms index divisor?  Setting that to 2 would halve the
amount of RAM required but double (on average) the seek time to locate
a given term (but, depending on your queries, that seek time may still
be a negligible part of overall query time, ie the tradeoff could be very worth 
it).
   

On Monday I plan to switch to Solr 1.4.1 on our test machine and experiment 
with the index divisor.  Is there an example of how to set up the divisor 
parameter in solrconfig.xml somewhere?

   

In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
parallel arrays instead of separate objects, and,
we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will show 
this gain...;
   

I'm looking forward to a number of the developments in 4.0, but am a bit wary 
of using it in production.   I've wanted to work in some tests with 4.0, but 
other more pressing issues have so far prevented this.

What about Lucene 2205?  Would that be a way to get some of the benefit similar 
to the changes in flex without the rest of the changes in flex and 4.0?

   

I'd be really curious to test the RAM reduction in 4.0 on your terms  
dict/index --
is there any way I could get a copy of just the tii/tis  files in your index?  
Your index is a great test for Lucene!
   

We haven't been able to make much data available due to copyright and other 
legal issues.  However, since there is absolutely no way anyone could 
reconstruct copyrighted works from the tii/tis index alone, that should be ok 
on that front.  On Monday I'll try to get legal/administrative clearance to 
provide the data and also ask around and see if I can get the ok to either find 
a spare hard drive to ship, or make some kind of sftp arrangement.  Hopefully 
we will find a way to be able to do this.

BTW  Most of the terms are probably the result of  dirty OCR and the impact is probably 
increased by our present "punctuation filter".  When we re-index we plan to use 
a more intelligent filter that will truncate extremely long tokens on punctuation and we 
also plan to do some minimal prefiltering prior to sending documents to Solr for 
indexing.  However, since with now have over 400 languages , we will have to be 
conservative in our filtering since we would rather  index dirty OCR than risk not 
indexing legitimate content.

Tom

   


mm=0?

2010-09-11 Thread Satish Kumar
Hi,

We have a requirement to show at least one result every time -- i.e., even
if user entered term is not found in any of the documents. I was hoping
setting mm to 0 will return results in all cases, but it is not.

For example, if user entered term "alpha" and it is *not* in any of the
documents in the index, any document in the index can be returned. If term
"alpha" is in the document set, documents having the term "alpha" only must
be returned.

My idea so far is to perform a search using user entered term. If there are
any results, return them. If there are no results, perform another search
without the query term-- this means doing two searches. Any suggestions on
implementing this requirement using only one search?


Thanks,
Satish


RE: Solr memory use, jmap and TermInfos/tii

2010-09-11 Thread Burton-West, Tom
Thanks Mike,

>>Do you use a terms index divisor?  Setting that to 2 would halve the
>>amount of RAM required but double (on average) the seek time to locate
>>a given term (but, depending on your queries, that seek time may still
>>be a negligible part of overall query time, ie the tradeoff could be very 
>>worth it).

On Monday I plan to switch to Solr 1.4.1 on our test machine and experiment 
with the index divisor.  Is there an example of how to set up the divisor 
parameter in solrconfig.xml somewhere?

>>In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large 
>>parallel arrays instead of separate objects, and, 
>>we hold much less in RAM.  Simply upgrading to 4.0 and re-indexing will show 
>>this gain...; 

I'm looking forward to a number of the developments in 4.0, but am a bit wary 
of using it in production.   I've wanted to work in some tests with 4.0, but 
other more pressing issues have so far prevented this.

What about Lucene 2205?  Would that be a way to get some of the benefit similar 
to the changes in flex without the rest of the changes in flex and 4.0?

>>I'd be really curious to test the RAM reduction in 4.0 on your terms  
>>dict/index -- 
>>is there any way I could get a copy of just the tii/tis  files in your index? 
>> Your index is a great test for Lucene!

We haven't been able to make much data available due to copyright and other 
legal issues.  However, since there is absolutely no way anyone could 
reconstruct copyrighted works from the tii/tis index alone, that should be ok 
on that front.  On Monday I'll try to get legal/administrative clearance to 
provide the data and also ask around and see if I can get the ok to either find 
a spare hard drive to ship, or make some kind of sftp arrangement.  Hopefully 
we will find a way to be able to do this.

BTW  Most of the terms are probably the result of  dirty OCR and the impact is 
probably increased by our present "punctuation filter".  When we re-index we 
plan to use a more intelligent filter that will truncate extremely long tokens 
on punctuation and we also plan to do some minimal prefiltering prior to 
sending documents to Solr for indexing.  However, since with now have over 400 
languages , we will have to be conservative in our filtering since we would 
rather  index dirty OCR than risk not indexing legitimate content.  

Tom



RE: multivalued fields in result

2010-09-11 Thread Markus Jelsma
Yes, you'll get what is stored and asked for. 
 
-Original message-
From: Jason Chaffee 
Sent: Sat 11-09-2010 05:27
To: solr-user@lucene.apache.org; 
Subject: multivalued fields in result

Is it possible to return multivalued files in the result?  

I would like to have a multivalued field that is stored and not indexed (I also 
copy the same field into another field where it is tokenized and indexed).  I 
would then like all the values of this field returned in the result set.  Is 
there a way to do this?

If it is not possible, could someone elaborate why that is so that I may see if 
I can make it work.

thanks,

Jason


Re: Autocomplete with Filter Query

2010-09-11 Thread Ingo Renner

Am 10.09.2010 um 17:14 schrieb David Yang:

Hi David,

> Is there any way to provide autocomplete while filtering results?

yes, you can use facets to achieve that.


best
Ingo

-- 
Ingo Renner
TYPO3 Core Developer, Release Manager TYPO3 4.2, Admin Google Summer of Code

TYPO3 - Open Source Enterprise Content Management System
http://typo3.org

Apache Solr for TYPO3 - Enterprise Search meets Enterprise Content Management
http://www.typo3-solr.com









Re: Solr memory use, jmap and TermInfos/tii

2010-09-11 Thread Michael McCandless
Unfortunately, the terms index (before 4.0) is not RAM efficient -- I
wrote about this here:

http://chbits.blogspot.com/2010/07/lucenes-ram-usage-for-searching.html

Every indexed term that's loaded into RAM creates 4 objects (TermInfo,
Term, String, char[]), as you see in your profiler output.  And each
object has a number of fields, the header required by the JRE, GC
cost, etc.

Do you use a terms index divisor?  Setting that to 2 would halve the
amount of RAM required but double (on average) the seek time to locate
a given term (but, depending on your queries, that seek time may still
be a negligible part of overall query time, ie the tradeoff could be
very worth it).

In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use
large parallel arrays instead of separate objects, and, we hold much
less in RAM.  Simply upgrading to 4.0 and re-indexing will show this
gain; however, we have reduced the terms index interval from 128 to
32, so if you want a "fair" comparison you should set this back to 128
for your indexing (or, set a terms index divisor of 4 when opening
your readers).

Note that [C aren't UTF8 character arrays -- they are UTF16, meaning
they always consume 2 bytes per character.  But, in 4.0, they are in
fact UTF8 arrays, so, depending on your character distribution, this
can also be a win (or, in some cases, a loss, which is why we are
considering the more efficient BOCU1 encoding by default in
LUCENE-1799).

I'd be really curious to test the RAM reduction in 4.0 on your terms
dict/index -- is there any way I could get a copy of just the tii/tis
files in your index?  Your index is a great test for Lucene!

Mike

On Fri, Sep 10, 2010 at 6:46 PM, Burton-West, Tom  wrote:
> Hi all,
>
> When we run the first query after starting up Solr, memory use goes up from 
> about 1GB to 15GB and never goes below that level.  In debugging a recent OOM 
> problem I ran jmap with the output appended below.  Not surprisingly, given 
> the size of our indexes, it looks like the TermInfo and Term data structures 
> which are the in-memory representation of the tii file are taking up most of 
> the memory. This is running Solr under Tomcat with 16GB allocated to the jvm 
> and 3 shards each with a tii file of about 600MB.
>
> Total index size is about 400GB for each shard (we are indexing about 600,000 
> full-text books in each shard).
>
> In interpreting the jmap output, can we assume that the listings for utf8 
> character arrays ("[C"), java.lang.String, long int arrays ("[J), and int 
> arrays ("[i) are all part of the data structures involved in representing the 
> tii file in memory?
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>
> (jmap output, commas in numbers added)
>
> num     #instances         #bytes  class name
> --
>   1:      82,496,803     4,273,137,904  [C
>   2:      82,498,673     3,299,946,920  java.lang.String
>   3:      27,810,887     1,112,435,480  org.apache.lucene.index.TermInfo
>   4:      27,533,080     1,101,323,200  org.apache.lucene.index.TermInfo
>   5:      27,115,577     1,084,623,080  org.apache.lucene.index.TermInfo
>   6:      27,810,894      889,948,608  org.apache.lucene.index.Term
>   7:      27,533,088      881,058,816  org.apache.lucene.index.Term
>   8:      27,115,589      867,698,848  org.apache.lucene.index.Term
>   9:           148      659,685,520  [J
>  10:             2      222,487,072  [Lorg.apache.lucene.index.Term;
>  11:             2      222,487,072  [Lorg.apache.lucene.index.TermInfo;
>  12:             2      220,264,600  [Lorg.apache.lucene.index.Term;
>  13:             2      220,264,600  [Lorg.apache.lucene.index.TermInfo;
>  14:             2      216,924,560  [Lorg.apache.lucene.index.Term;
>  15:             2      216,924,560  [Lorg.apache.lucene.index.TermInfo;
>  16:        737,060      155,114,960  [I
>  17:        627,793       35,156,408  java.lang.ref.SoftReference
>
>
>
>
>