Re: Hierarchical Facet Field Prefix Not Working

2009-09-25 Thread Koji Sekiguchi

Hi Nasseam,

I think per field parameter for facet.prefix should be worked
on hierarchical facet fields by briefly looking at the patch.
And I can get same facet results by:




when using sample data in the thread SOLR-64.
It is likely I'm missing something.


Nasseam Elkarra wrote:

Hello all,

We are using the patch from SOLR-64 
( to implement 
hierarchical facets for categories. We are trying to use the 
facet.prefix to prevent all categories from coming back. However, 
f.category.facet.prefix doesn't work. Using facet.prefix works but 
prevents the other facets from coming back since it is a global 
option. Are per facet options supported on hierarchical facet fields? 
If not, how can I get a specific category and it's children without 
getting the surrounding categories?

Any help is much appreciated.

Thank you,

Nasseam Elkarra
Re: Mixed field types and boolean searching

2009-09-25 Thread Lance Norskog
The DisMax parser essentially creates a set of queries against
different fields. These queries are analyzed as per each field.

I think this what you are talking about- "The" in a movie title is
diffferent from "the" in the movie description. Would you expect "The
Sound Of Music" to fetch every movie in the database? So "the" is a
stopword in the description but is not in the title.

Also, the DisMax parser has no OR. It has +, - and "at least one of
and more is better". The query "A B" means "A or B but both is
better". "+a +b" means "a AND B". "+a b" means "must have 'a' but is
better with 'b'".

On Fri, Sep 25, 2009 at 7:04 AM, Ensdorf Ken  wrote:
>> No- there are various analyzers. StandardAnalyzer is geared toward
>> searching bodies of text for interesting words -  punctuation is
>> ripped out. Other analyzers are more useful for "concrete" text. You
>> may have to work at finding one that leaves punctuation in.
> My problem is not with the StandardAnalyzer per se, but more as to how 
> "dismax" style queries are handled by the query parser when the different 
> fields have different sets of ignored tokens or stop words.
> Say you want to use the contents of a text box in your app and query a field 
> in Solr.  The user enters "A and B", so you map this to "f1:A and f1:B".  
> Now, if "B" is an ignored token in the "f1" field for whatever reason, the 
> query boils down to "f1:A".
> Now imagine you want to allow the user's text to match multiple fields - as 
> in any term can match any field, but all terms must match at least 1 field.  
> So now you map the user's query to "(f1:A OR f2:A) AND (f1:B OR f2:B)".  But 
> if f2 does not ignore "B", the query boils down to "(f1:A OR f2:A) AND 
> (f2:B)".  Now documents that could come back when you were only matching 
> against the f1 field don't come back.
> This seems counter-intuitive - to be consistent, I would think the query 
> should essentially be treated as "(f1:A OR f2:A) AND (TRUE OR f2:B) " - and 
> thus a term that is a stop word or ignored token for any of the fields would 
> be ignored across the board.
> So I guess what I'm asking is if there is a reason for the existing behavior, 
> or is it just a fact-of-life of the query parser?  Thanks!
> -Ken

Problem changing the default MergePolicy/Scheduler

2009-09-25 Thread Jibo John


It looks like solr is not allowing me to change the default  
MergePolicy/Scheduler classes.

Even if I change the default MergePolicy/ 
Scheduler(LogByteSizeMErgePolicy and ConcurrentMergeScheduler) defined  
in solrconfig.xml to a different one (LogDocMergePolicy and  
SerialMergeScheduler), my profiler shows the default classes are still  
being loaded.

Also, if I use the default LogByteSizeMergePolicy, I can't seem to  
override the 'calibrateSizeByDeletes' to 'true' value using solrconfig  
using the new syntax that was introduced this week (SOLR-1447).

I'm using the version checked out from trunk yesterday.

Any pointers will be helpful.


Re: Showcase: Facetted Search for Wine using Solr

2009-09-25 Thread Lance Norskog
Have you seen this? It is another Solr/Typeo3 integration project.

Would you consider open-sourcing your Solr/Typo3 integration?

On Fri, Sep 25, 2009 at 1:18 AM, Marian Steinbach
> Hi Grant!
> Thanks for the advidce, I added the link to the list.
> Regards,
> Marian
> On Fri, Sep 25, 2009 at 5:14 AM, Grant Ingersoll  wrote:
>> Hi Marian,
>> Looks great!  Wish I could order some wine.  When you get a chance, please
>> add the site to!
>> Cheers,
>> Grant
>> On Sep 24, 2009, at 11:51 AM, marian.steinbach wrote:
>>> Hello everybody!
>>> The purpose of this mail is to say "thank you" to the creators of Solr
>>> and to the community that supports it.
>>> We released our first project using Solr several weeks ago, after
>>> having tested Solr for several months.
>>> The project I'm talking about is a product search for an online wine
>>> shop (sorry, german user interface only):
>>> Our client offers about 3000 different wines and other related products.
>>> Before we introduced Solr, the products have been searched via
>>> complicated and slow SQL statements, with all kinds problems related
>>> to that. No full text indexing, no stemming etc.
>>> We are happy to make use of several built-in features which solve
>>> problems that bugged us: Facetted search, german accents and stemming
>>> and synonyms beeing the most important ones.
>>> The surrounding website is TYPO3 driven. We integrated Solr by
>>> creating our own frontend plugin which talks to the Solr webservice
>>> (and we're very happy about the PHP output type!).
>>> I'd be glad about your comments.
>>> Cheers,
>>> Marian
>> --
>> Grant Ingersoll
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:

Re: Solr http post performance seems slow - help?

2009-09-25 Thread Lance Norskog
Your indexing project is disk-bound. My modern midrange laptop gets
30MB/s doing "cat > /dev/null" (1 7200rpm disk). The Amazon instances
I'm playing with get 50-60 (I really want to know how it fits
together). Your laptop might be 10-20?

On Thu, Sep 24, 2009 at 11:54 PM, Constantijn Visinescu
> This may or may not help but here goes :)
> When i was running performance tests i look a look at the simple post tool
> that comes with the solr examples.
> First i changed my schema.xml to fit my needs and then i deleted the old
> index so solr created a blank one when i started up.
> Then i had a had a process chew on my data and spit out xml files that are
> formatted similarly to the xml files that the SimplePostTool example uses.
> Next i used the simple Post tool to post the xml files to solr (60k-80k
> records per xml file). Each file only took a couple minutes to index this
> way.
> Comit and optimize after that (took less then 10 minutes) and after about
> 2.5 hrs i had indexed just under 8 milion records.
> This was on a 4 year old single core laptop using resin 3 as my servlet
> container.
> Hope this helps.
> On Fri, Sep 25, 2009 at 3:51 AM, Lance Norskog  wrote:
>> In "top", press the '1' key. This will give a list of the CPUs and how
>> much load is on each. The display is otherwise a little weird for
>> multi-cpu machines. But don't be surprised when Solr is I/O bound. The
>> biggest fanciest RAID is often a better investment than CPUs. On one
>> project we bought low-end rack servers come with 6-8 disk bays,
>> filling them with 10k/15k RPM disks.
>> On Wed, Sep 23, 2009 at 2:47 PM, Dan A. Dickey 
>> wrote:
>> > On Friday 11 September 2009 11:06:20 am Dan A. Dickey wrote:
>> > ...
>> >> Our JBoss expert and I will be looking into why this might be occurring.
>> >> Does anyone know of any JBoss related slowness with Solr?
>> >> And does anyone have any other sort of suggestions to speed indexing
>> >> performance?   Thanks for your help all!  I'll keep you up to date with
>> >> further progress.
>> >
>> > Ok, further progress... just to keep any interested parties up to date
>> > and for the record...
>> >
>> > I'm finding that using the "example" jetty setup (will be switching very
>> > very soon to a "real" jetty installation) is about the fastest.  Using
>> > several processes to send posts to Solr helps a lot, and we're seeing
>> > about 80 posts a second this way.
>> >
>> > We also stripped down JBoss to the bare bones and the Solr in it
>> > is running nearly as fast - about 50 posts a second.  It was our previous
>> > JBoss configuration that was making it appear "slow" for some reason.
>> >
>> > We will be running more tests and spreading out the "pre-index" workload
>> > across more machines and more processes. In our case we were seeing
>> > the bottleneck being one machine running 18 processes.
>> > The 2 quad core xeon system is experiencing about a 25% cpu load.
>> > And I'm not certain, but I think this may be actually 25% of one of the 8
>> cores.
>> > So, there's *lots* of room for Solr to be doing more work there.
>> >        -Dan
>> >
>> > --
>> > Dan A. Dickey | Senior Software Engineer
>> >
>> > Savvis
>> > 10900 Hampshire Ave. S., Bloomington, MN  55438
>> > Office: 952.852.4803 | Fax: 952.852.4951
>> > E-mail:
>> >
>> --
>> Lance Norskog

Re: problem with HTMLStripStandardTokenizerFactory

2009-09-25 Thread Yonik Seeley
Can you give a small test file that demonstrates the problem?


On Fri, Sep 25, 2009 at 5:34 AM, Kundig, Andreas
> Hello
> I can't bring HTMLStripStandardTokenizerFactory to remove the content of the 
> style tag, as the documentation says it should.
> A search for 'mso' returns a document where the search term only appears in 
> the style tag (it's a word document saved as html). Here is the highlight 
> returned by solr (by the way: the wrong word is highlighted).
> "vetica;
\n\tpanose-1:2 11 5 4 2 2 2 2 2 
> 4;
> I am using solr 1.3. Here is how I configured the tokenizer in schema.xml
>     positionIncrementGap="100">
>         words="stopwords.txt" enablePositionIncrements="true"/>
>         generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
> splitOnCaseChange="1"/>
>         protected="protwords.txt"/>
>         ignoreCase="true" expand="true"/>
>         words="stopwords.txt"/>
>         generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1"/>
>         protected="protwords.txt"/>
> Am I doing something wrong?
> thank you
> Andréas Kündig
RE: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
Sorry for OFF-topic:
Create dummy "Hello, World!" JSP, use Tomcat, execute load-stress
simulator(s) from separate machine(s), and measure... don't forget to
allocate necessary thread pools in Tomcat (if you have to)...
Although such JSP doesn't use any memory, you will see how easy one can go
with 5000 TPS (or 'virtually' 5 concurrent users) on modern quad-cores
by simply allocating more memory (...GB) and more Tomcat threads. There is
threshold too... repeat it with HTTPD Workers (and threads), same result,
although it doesn't use any GC. More memory - more threads - more "keep
alives" per TCP...

However, 'theoretically' you need only 64Mb for "Hello World" :)))

Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel
I have around 8M documents.
I set up my server to use a different collector and it seems like it
decreased from 11% to 4%, of course I need to wait a bit more because it is
just a 1 hour old log. But it seems like it is much better now.
I will tell you on Monday the results :)

On Fri, Sep 25, 2009 at 6:07 PM, Mark Miller  wrote:

> Thats a good point too - if you can reduce your need for such a large
> heap, by all means, do so.
> However, considering you already need at least 10GB or you get OOM, you
> have a long way to go with that approach. Good luck :)
> How many docs do you have ? I'm guessing its mostly FieldCache type
> stuff, and thats the type of thing you can't really side step, unless
> you give up the functionality thats using it.
> Grant Ingersoll wrote:
> >
> > On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:
> >
> >> Hi to all!
> >> Lately my solr servers seem to stop responding once in a while. I'm
> >> using
> >> solr 1.3.
> >> Of course I'm having more traffic on the servers.
> >> So I logged the Garbage Collection activity to check if it's because of
> >> that. It seems like 11% of the time the application runs, it is stopped
> >> because of GC. And some times the GC takes up to 10 seconds!
> >> Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
> >> servers. My index is around 10GB and I'm giving to the instances 10GB of
> >> RAM.
> >>
> >> How can I check which is the GC that it is being used? If I'm right JVM
> >> Ergonomics should use the Throughput GC, but I'm not 100% sure. Do
> >> you have
> >> any recommendation on this?
> >
> >
> > As I said in Eteve's thread on JVM settings, some extra time spent on
> > application design/debugging will save a whole lot of headache in
> > Garbage Collection and trying to tune the gazillion different options
> > available.  Ask yourself:  What is on the heap and does it need to be
> > there?  For instance, do you, if you have them, really need sortable
> > ints?   If your servers seem to come to a stop, I'm going to bet you
> > have major collections going on.  Major collections in a production
> > system are very bad.  They tend to happen right after commits in
> > poorly tuned systems, but can also happen in other places if you let
> > things build up due to really large heaps and/or things like really
> > large cache settings.  I would pull up jConsole and have a look at
> > what is happening when the pauses occur.  Is it a major collection?
> > If so, then hook up a heap analyzer or a profiler and see what is on
> > the heap around those times.  Then have a look at your schema/config,
> > etc. and see if there are things that are memory intensive (sorting,
> > faceting, excessively large filter caches).
> >
> > --
> > Grant Ingersoll
> >
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> > using Solr/Lucene:
> >
> >
> --
> - Mark

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
One more point and I'll stop - I've hit my email quota for the day ;)

While its a pain to have to juggle GC params and tune - when you require
a heap thats more than a gig or two, I personally believe its essential
to do so for good performance. The (default settings / ergonomics with
throughput) just don't cut it. Sad fact of life :) Luckily, you don't
generally have to do that much to get things nice - the number of
options is not that staggering, and you don't usually need to get into
most of them. Choosing the right collector, and tweaking a setting or
two can often be enough.

The most important to do with a large heap and the throughput collector
is to turn on parallel tenured collection. I've said it before, but it
really is key. At least if you have more than a processor or two -
which, for your sake, I hope you do :)

- Mark

Mark Miller wrote:
> Thats a good point too - if you can reduce your need for such a large
> heap, by all means, do so.
> However, considering you already need at least 10GB or you get OOM, you
> have a long way to go with that approach. Good luck :)
> How many docs do you have ? I'm guessing its mostly FieldCache type
> stuff, and thats the type of thing you can't really side step, unless
> you give up the functionality thats using it.
> Grant Ingersoll wrote:
>> On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:
>>> Hi to all!
>>> Lately my solr servers seem to stop responding once in a while. I'm
>>> using
>>> solr 1.3.
>>> Of course I'm having more traffic on the servers.
>>> So I logged the Garbage Collection activity to check if it's because of
>>> that. It seems like 11% of the time the application runs, it is stopped
>>> because of GC. And some times the GC takes up to 10 seconds!
>>> Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
>>> servers. My index is around 10GB and I'm giving to the instances 10GB of
>>> RAM.
>>> How can I check which is the GC that it is being used? If I'm right JVM
>>> Ergonomics should use the Throughput GC, but I'm not 100% sure. Do
>>> you have
>>> any recommendation on this?
>> As I said in Eteve's thread on JVM settings, some extra time spent on
>> application design/debugging will save a whole lot of headache in
>> Garbage Collection and trying to tune the gazillion different options
>> available.  Ask yourself:  What is on the heap and does it need to be
>> there?  For instance, do you, if you have them, really need sortable
>> ints?   If your servers seem to come to a stop, I'm going to bet you
>> have major collections going on.  Major collections in a production
>> system are very bad.  They tend to happen right after commits in
>> poorly tuned systems, but can also happen in other places if you let
>> things build up due to really large heaps and/or things like really
>> large cache settings.  I would pull up jConsole and have a look at
>> what is happening when the pauses occur.  Is it a major collection? 
>> If so, then hook up a heap analyzer or a profiler and see what is on
>> the heap around those times.  Then have a look at your schema/config,
>> etc. and see if there are things that are memory intensive (sorting,
>> faceting, excessively large filter caches).
>> --
>> Grant Ingersoll
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:

- Mark

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Thats a good point too - if you can reduce your need for such a large
heap, by all means, do so.

However, considering you already need at least 10GB or you get OOM, you
have a long way to go with that approach. Good luck :)

How many docs do you have ? I'm guessing its mostly FieldCache type
stuff, and thats the type of thing you can't really side step, unless
you give up the functionality thats using it.

Grant Ingersoll wrote:
> On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:
>> Hi to all!
>> Lately my solr servers seem to stop responding once in a while. I'm
>> using
>> solr 1.3.
>> Of course I'm having more traffic on the servers.
>> So I logged the Garbage Collection activity to check if it's because of
>> that. It seems like 11% of the time the application runs, it is stopped
>> because of GC. And some times the GC takes up to 10 seconds!
>> Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
>> servers. My index is around 10GB and I'm giving to the instances 10GB of
>> RAM.
>> How can I check which is the GC that it is being used? If I'm right JVM
>> Ergonomics should use the Throughput GC, but I'm not 100% sure. Do
>> you have
>> any recommendation on this?
> As I said in Eteve's thread on JVM settings, some extra time spent on
> application design/debugging will save a whole lot of headache in
> Garbage Collection and trying to tune the gazillion different options
> available.  Ask yourself:  What is on the heap and does it need to be
> there?  For instance, do you, if you have them, really need sortable
> ints?   If your servers seem to come to a stop, I'm going to bet you
> have major collections going on.  Major collections in a production
> system are very bad.  They tend to happen right after commits in
> poorly tuned systems, but can also happen in other places if you let
> things build up due to really large heaps and/or things like really
> large cache settings.  I would pull up jConsole and have a look at
> what is happening when the pauses occur.  Is it a major collection? 
> If so, then hook up a heap analyzer or a profiler and see what is on
> the heap around those times.  Then have a look at your schema/config,
> etc. and see if there are things that are memory intensive (sorting,
> faceting, excessively large filter caches).
> --
> Grant Ingersoll
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:

- Mark

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Mark Miller wrote:
> Jonathan Ariel wrote:
>> How can I check which is the GC that it is being used? If I'm right JVM
>> Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have
>> any recommendation on this?
> Just to straighten out this one too - Ergonomics doesn't use throughput
> - throughput is the collector that allows Ergonomics ;)
> And throughput is the default as long as your machine is detected as
> server class.
> But throughput is not great with large tenured spaces out of the box. It
> only parallelizes the new space collection. You have to turn on an
> option to get parallel tenured collection as well - which is essential
> to scale to large heap sizes.
hmm - I'm not being totally accurate there - ergonomics is what detects
server and so makes throughput the default collector for a server
machine. But much of the GC ergonomics support only works with the
throughput collector. Kind of chicken and egg :)

- Mark

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Jonathan Ariel wrote:
> How can I check which is the GC that it is being used? If I'm right JVM
> Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have
> any recommendation on this?
Just to straighten out this one too - Ergonomics doesn't use throughput
- throughput is the collector that allows Ergonomics ;)

And throughput is the default as long as your machine is detected as
server class.

But throughput is not great with large tenured spaces out of the box. It
only parallelizes the new space collection. You have to turn on an
option to get parallel tenured collection as well - which is essential
to scale to large heap sizes.

- Mark

RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
> Usually, fragmentation is dealt with using a mark-compact collector (or
> IBM has used a mark-sweep-compact collector).
> Copying collectors are not only super efficient at collecting young
> spaces, but they are also great for fragmentation - when you copy
> everything to the new space, you can remove any fragmentation. At the
> cost of double the space requirements though.

So that if memory size is optimized (application specific!) no any object
copy will ever happen, although it is server-loading specific too
(application-usage-specific; what do they do most frequently?)
- just statistics, need to monitor JVM and make decision.

Few years ago I had hard time explaining to client that byte array should be
Base64 encoded instead of just 123... instead of GC tuning...

SOLR uses XML; try to upload big XML - each Element instance needs at least
100 bytes... try to create array of 20M of Elements (parser will do!)... so
that any GC tuning is application-usage specific too... RAM allocation and
GC tuning is "usage"-specific, not SOLR-specific...

Re: Solr and Garbage Collection

2009-09-25 Thread Grant Ingersoll

On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:

Hi to all!
Lately my solr servers seem to stop responding once in a while. I'm  

solr 1.3.
Of course I'm having more traffic on the servers.
So I logged the Garbage Collection activity to check if it's because  
that. It seems like 11% of the time the application runs, it is  

because of GC. And some times the GC takes up to 10 seconds!
Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel  
servers. My index is around 10GB and I'm giving to the instances  
10GB of


How can I check which is the GC that it is being used? If I'm right  
Ergonomics should use the Throughput GC, but I'm not 100% sure. Do  
you have

any recommendation on this?

As I said in Eteve's thread on JVM settings, some extra time spent on  
application design/debugging will save a whole lot of headache in  
Garbage Collection and trying to tune the gazillion different options  
available.  Ask yourself:  What is on the heap and does it need to be  
there?  For instance, do you, if you have them, really need sortable  
ints?   If your servers seem to come to a stop, I'm going to bet you  
have major collections going on.  Major collections in a production  
system are very bad.  They tend to happen right after commits in  
poorly tuned systems, but can also happen in other places if you let  
things build up due to really large heaps and/or things like really  
large cache settings.  I would pull up jConsole and have a look at  
what is happening when the pauses occur.  Is it a major collection?   
If so, then hook up a heap analyzer or a profiler and see what is on  
the heap around those times.  Then have a look at your schema/config,  
etc. and see if there are things that are memory intensive (sorting,  
faceting, excessively large filter caches).

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
>> or IBM has used a mark-sweep-compact collector

Never mind - Sun's is also sometimes referred to as mark-sweep-compact.
I've just seen it referred to as mark-compact before as well. In either
case though, without some sort of sweep phase, there is no reclamation
of memory :)

It's interesting though - in the days of the early JVM's Sun talked more
about compaction - but if you look at their recent info, they don't even
mention it, or give you params to messs with it. They just talk about
the mark and the sweep phase.

IBM is much more open about a compaction phase, and not only do they
give controls to tune it, they let you turn it off completely.

Not sure what Sun is doing with compaction these days - or if they just
work with fragmentation avoidance techniques instead - haven't seen any
info on it.

Mark Miller wrote:
> When we talk about Collectors, we are not just talking about
> "collecting" - whatever that means. There isn't really a "collecting"
> phase - the whole algorithm is garbage collecting - hence calling the
> different implementations "collectors".
> Usually, fragmentation is dealt with using a mark-compact collector (or
> IBM has used a mark-sweep-compact collector).
> Copying collectors are not only super efficient at collecting young
> spaces, but they are also great for fragmentation - when you copy
> everything to the new space, you can remove any fragmentation. At the
> cost of double the space requirements though.
> So mark-compact is a compromise. First you mark whats reachable, then
> everything thats marked is copied/compacted to the bottom of the heap.
> Its all part of a "collection" though.
> Jonathan Ariel wrote:
>> Maybe what's missing here is how did I get the 11%.I just ran solr with the
>> following JVM params: -XX:+PrintGCApplicationConcurrentTime
>> -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
>> time the application run between collection pauses and the length of the
>> collection pauses, respectively.
>> I think that in this case the 11% is just for memory collection and not
>> defragmentation... but I'm not 100% sure.
>> On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi  wrote:
>>> But again, GC is not just "Garbage Collection" as many in this thread
>>> think... it is also "memory defragmentation" which is much costly than
>>> "collection" just because it needs move somewhere _live_objects_ (and
>>> wait/lock till such objects get unlocked to be moved...) - obviously more
>>> memory helps...
>>> 11% is extremely high.
>>> -Fuad
 -Original Message-
 From: Jonathan Ariel []
 Sent: September-25-09 3:36 PM
 Subject: Re: FW: Solr and Garbage Collection

 I'm not planning on lowering the heap. I just want to lower the time
 "wasted" on GC, which is 11% right now.So what I'll try is changing the
>>> GC
 to -XX:+UseConcMarkSweepGC

 On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:

> Mark,
> what if piece of code needs 10 contiguous Kb to load a document field?
>>> How
> locked memory pieces are optimized/moved (putting on hold almost whole
> application)?
> Lowering heap is _bad_ idea; we will have extremely frequent GC
>>> (optimize
> of
> live objects!!!) even if RAM is (theoretically) enough.
> -Fuad
>> Faud, you didn't read the thread right.
>> He is not having a problem with OOM. He got the OOM because he
>>> lowered
>> the heap to try and help GC.
>> He normally runs with a heap that can handle his FC.
>> Please re-read the thread. You are confusing the tread.
>> - Mark
>>> GC will frequently happen even if RAM is more than enough: in case
>>> if
>>> it
> is
>>> heavily sparse... so that have even more RAM!
>>> -Fuad

- Mark

Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel
Ok. I'll first change the GC and see if the time spent decreased. Than
I'll try increasing the heap as Fuad recommends.

On 9/25/09, Mark Miller  wrote:
> When we talk about Collectors, we are not just talking about
> "collecting" - whatever that means. There isn't really a "collecting"
> phase - the whole algorithm is garbage collecting - hence calling the
> different implementations "collectors".
> Usually, fragmentation is dealt with using a mark-compact collector (or
> IBM has used a mark-sweep-compact collector).
> Copying collectors are not only super efficient at collecting young
> spaces, but they are also great for fragmentation - when you copy
> everything to the new space, you can remove any fragmentation. At the
> cost of double the space requirements though.
> So mark-compact is a compromise. First you mark whats reachable, then
> everything thats marked is copied/compacted to the bottom of the heap.
> Its all part of a "collection" though.
> Jonathan Ariel wrote:
>> Maybe what's missing here is how did I get the 11%.I just ran solr with
>> the
>> following JVM params: -XX:+PrintGCApplicationConcurrentTime
>> -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
>> time the application run between collection pauses and the length of the
>> collection pauses, respectively.
>> I think that in this case the 11% is just for memory collection and not
>> defragmentation... but I'm not 100% sure.
>> On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi  wrote:
>>> But again, GC is not just "Garbage Collection" as many in this thread
>>> think... it is also "memory defragmentation" which is much costly than
>>> "collection" just because it needs move somewhere _live_objects_ (and
>>> wait/lock till such objects get unlocked to be moved...) - obviously more
>>> memory helps...
>>> 11% is extremely high.
>>> -Fuad
 -Original Message-
 From: Jonathan Ariel []
 Sent: September-25-09 3:36 PM
 Subject: Re: FW: Solr and Garbage Collection

 I'm not planning on lowering the heap. I just want to lower the time
 "wasted" on GC, which is 11% right now.So what I'll try is changing the

>>> GC
 to -XX:+UseConcMarkSweepGC

 On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:

> Mark,
> what if piece of code needs 10 contiguous Kb to load a document field?
>>> How
> locked memory pieces are optimized/moved (putting on hold almost whole
> application)?
> Lowering heap is _bad_ idea; we will have extremely frequent GC
>>> (optimize
> of
> live objects!!!) even if RAM is (theoretically) enough.
> -Fuad
>> Faud, you didn't read the thread right.
>> He is not having a problem with OOM. He got the OOM because he
>>> lowered
>> the heap to try and help GC.
>> He normally runs with a heap that can handle his FC.
>> Please re-read the thread. You are confusing the tread.
>> - Mark
>>> GC will frequently happen even if RAM is more than enough: in case
>>> if
>>> it
> is
>>> heavily sparse... so that have even more RAM!
>>> -Fuad
> --
> - Mark

solr home

2009-09-25 Thread Park, Michael
I already have a handful of solr instances running .  However, I'm
trying to install solr (1.4) on a new linux server with tomcat using a
context file (same way I usually do):




However it throws an exception due to the following:

SEVERE: Could not start SOLR. Check solr/home property

java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in
classpath or 'solr/conf/', cwd=/opt/local/solr/fedora_solr




Any ideas why this is happening?


Thanks, Mike

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
When we talk about Collectors, we are not just talking about
"collecting" - whatever that means. There isn't really a "collecting"
phase - the whole algorithm is garbage collecting - hence calling the
different implementations "collectors".

Usually, fragmentation is dealt with using a mark-compact collector (or
IBM has used a mark-sweep-compact collector).
Copying collectors are not only super efficient at collecting young
spaces, but they are also great for fragmentation - when you copy
everything to the new space, you can remove any fragmentation. At the
cost of double the space requirements though.

So mark-compact is a compromise. First you mark whats reachable, then
everything thats marked is copied/compacted to the bottom of the heap.
Its all part of a "collection" though.

Jonathan Ariel wrote:
> Maybe what's missing here is how did I get the 11%.I just ran solr with the
> following JVM params: -XX:+PrintGCApplicationConcurrentTime
> -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
> time the application run between collection pauses and the length of the
> collection pauses, respectively.
> I think that in this case the 11% is just for memory collection and not
> defragmentation... but I'm not 100% sure.
> On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi  wrote:
>> But again, GC is not just "Garbage Collection" as many in this thread
>> think... it is also "memory defragmentation" which is much costly than
>> "collection" just because it needs move somewhere _live_objects_ (and
>> wait/lock till such objects get unlocked to be moved...) - obviously more
>> memory helps...
>> 11% is extremely high.
>> -Fuad
>>> -Original Message-
>>> From: Jonathan Ariel []
>>> Sent: September-25-09 3:36 PM
>>> To:
>>> Subject: Re: FW: Solr and Garbage Collection
>>> I'm not planning on lowering the heap. I just want to lower the time
>>> "wasted" on GC, which is 11% right now.So what I'll try is changing the
>> GC
>>> to -XX:+UseConcMarkSweepGC
>>> On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:

 what if piece of code needs 10 contiguous Kb to load a document field?
>> How
 locked memory pieces are optimized/moved (putting on hold almost whole
 Lowering heap is _bad_ idea; we will have extremely frequent GC
>> (optimize
 live objects!!!) even if RAM is (theoretically) enough.


> Faud, you didn't read the thread right.
> He is not having a problem with OOM. He got the OOM because he
>> lowered
> the heap to try and help GC.
> He normally runs with a heap that can handle his FC.
> Please re-read the thread. You are confusing the tread.
> - Mark
>> GC will frequently happen even if RAM is more than enough: in case
>> if
>> it
>> heavily sparse... so that have even more RAM!
>> -Fuad


- Mark

Re: Can we point a Solr server to index directory dynamically at runtime..

2009-09-25 Thread Silent Surfer
Hi Michael,

We are storing all our data in addition to index, as we need to display those 
values to the user. So unfortunately we cannot go with the option stored=false, 
which could have potentially solved our issue.

Appreciate any other pointers/suggestions


--- On Fri, 9/25/09, Michael  wrote:

> From: Michael 
> Subject: Re: Can we point a Solr server to index directory dynamically at  
> runtime..
> To:
> Date: Friday, September 25, 2009, 2:00 PM
> Are you storing (in addition to
> indexing) your data?  Perhaps you could turn
> off storage on data older than 7 days (requires
> reindexing), thus losing the
> ability to return snippets but cutting down on your storage
> space and server
> count.  I've experienced 10x decrease in space
> requirements and a large
> boost in speed after cutting extraneous storage from Solr
> -- the stored data
> is mixed in with the index data and so it slows down
> searches.
> You could also put all 200G onto one Solr instance rather
> than 10 for >7days
> data, and accept that those searches will be slower.
> Michael
> On Fri, Sep 25, 2009 at 1:34 AM, Silent Surfer 
> wrote:
> > Hi,
> >
> > Thank you Michael and Chris for the response.
> >
> > Today after the mail from Michael, we tested with the
> dynamic loading of
> > cores and it worked well. So we need to go with the
> hybrid approach of
> > Multicore and Distributed searching.
> >
> > As per our testing, we found that a Solr instance with
> 20 GB of
> > index(single index or spread across multiple cores)
> can provide better
> > performance when compared to having a Solr instance
> say 40 (or) 50 GB of
> > index (single index or index spread across cores).
> >
> > So the 200 GB of index on day 1 will be spread across
> 200/20=10 Solr salve
> > instances.
> >
> > On day 2 data, 10 more Solr slave servers are
> required; Cumulative Solr
> > Slave instances = 200*2/20=20
> > ...
> > ..
> > ..
> > On day 30 data, 10 more Solr slave servers are
> required; Cumulative Solr
> > Slave instances = 200*30/20=300
> >
> > So with the above approach, we may need ~300 Solr
> slave instances, which
> > becomes very unmanageable.
> >
> > But we know that most of the queries is for the past 1
> week, i.e we
> > definitely need 70 Solr Slaves containing the last 7
> days worth of data up
> > and running.
> >
> > Now for the rest of the 230 Solr instances, do we need
> to keep it running
> > for the odd query,that can span across the 30 days of
> data (30*200 GB=6 TB
> > data) which can come up only a couple of times a day.
> > This linear increase of Solr servers with the
> retention period doesn't
> > seems to be a very scalable solution.
> >
> > So we are looking for something more simpler approach
> to handle this
> > scenario.
> >
> > Appreciate any further inputs/suggestions.
> >
> > Regards,
> > sS
> >
> > --- On Fri, 9/25/09, Chris Hostetter 
> wrote:
> >
> > > From: Chris Hostetter 
> > > Subject: Re: Can we point a Solr server to index
> directory dynamically
> > at  runtime..
> > > To:
> > > Date: Friday, September 25, 2009, 4:04 AM
> > > : Using a multicore approach, you
> > > could send a "create a core named
> > > : 'core3weeksold' pointing to
> '/datadirs/3weeksold' "
> > > command to a live Solr,
> > > : which would spin it up on the fly.  Then
> you query
> > > it, and maybe keep it
> > > : spun up until it's not queried for 60 seconds
> or
> > > something, then send a
> > > : "remove core 'core3weeksold' " command.
> > > : See
> > > .
> > >
> > > something that seems implicit in the question is
> what to do
> > > when the
> > > request spans all of the data ... this is where
> (in theory)
> > > distributed
> > > searching could help you out.
> > >
> > > index each days worth of data into it's own core,
> that
> > > makes it really
> > > easy to expire the old data (just UNLOAD and
> delete an
> > > entire core once
> > > it's more then 30 days old) if your user is only
> searching
> > > "current" dta
> > > then your app can directly query the core
> containing the
> > > most current data
> > > -- but if they want to query the last week, or
> last two
> > > weeks worth of
> > > data, you do a distributed request for all of the
> shards
> > > needed to search
> > > the appropriate amount of data.
> > >
> > > Between the ALIAS and SWAP commands it on the
> CoreAdmin
> > > screen it should
> > > be pretty easy have cores with names like
> > > "today","1dayold","2dayold" so
> > > that your app can configure simple shard params
> for all the
> > > perumations
> > > you'll need to query.
> > >
> > >
> > > -Hoss
> > >
> > >
> >
> >
> >
> >
> >
> >

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel
Maybe what's missing here is how did I get the 11%.I just ran solr with the
following JVM params: -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
time the application run between collection pauses and the length of the
collection pauses, respectively.
I think that in this case the 11% is just for memory collection and not
defragmentation... but I'm not 100% sure.

On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi  wrote:

> But again, GC is not just "Garbage Collection" as many in this thread
> think... it is also "memory defragmentation" which is much costly than
> "collection" just because it needs move somewhere _live_objects_ (and
> wait/lock till such objects get unlocked to be moved...) - obviously more
> memory helps...
> 11% is extremely high.
> -Fuad
> > -Original Message-
> > From: Jonathan Ariel []
> > Sent: September-25-09 3:36 PM
> > To:
> > Subject: Re: FW: Solr and Garbage Collection
> >
> > I'm not planning on lowering the heap. I just want to lower the time
> > "wasted" on GC, which is 11% right now.So what I'll try is changing the
> GC
> > to -XX:+UseConcMarkSweepGC
> >
> > On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:
> >
> > > Mark,
> > >
> > > what if piece of code needs 10 contiguous Kb to load a document field?
> How
> > > locked memory pieces are optimized/moved (putting on hold almost whole
> > > application)?
> > > Lowering heap is _bad_ idea; we will have extremely frequent GC
> (optimize
> > > of
> > > live objects!!!) even if RAM is (theoretically) enough.
> > >
> > > -Fuad
> > >
> > >
> > > >Faud, you didn't read the thread right.
> > > >
> > > > He is not having a problem with OOM. He got the OOM because he
> lowered
> > > > the heap to try and help GC.
> > > >
> > > > He normally runs with a heap that can handle his FC.
> > > >
> > > > Please re-read the thread. You are confusing the tread.
> > > >
> > > > - Mark
> > > >
> > >
> > >
> > > >> GC will frequently happen even if RAM is more than enough: in case
> if
> it
> > > is
> > > >> heavily sparse... so that have even more RAM!
> > > >> -Fuad
> > >
> > >
> > >

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Yonik Seeley
On Fri, Sep 25, 2009 at 2:52 PM, Fuad Efendi  wrote:
> Lowering heap helps GC?

Yes.  In general, lowering the heap can help or hurt.

Hurt: if one is running very low on memory, GC will be working harder
all of the time trying to find more memory and the % of time that GC
takes can go up.

Help: if one has massive heaps, full GCs may not happen as frequently,
but when they do they can be larger and cause more of a problem.  For
many apps, a .2 second pause every minute is preferable to a 10 second
pause every hour.

And of course the other reason to lower the heap size *if* you don't
need it that big is to leave more memory for other stuff, and for the
OS itself to cache the index files.


RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
But again, GC is not just "Garbage Collection" as many in this thread
think... it is also "memory defragmentation" which is much costly than
"collection" just because it needs move somewhere _live_objects_ (and
wait/lock till such objects get unlocked to be moved...) - obviously more
memory helps...

11% is extremely high.


> -Original Message-
> From: Jonathan Ariel []
> Sent: September-25-09 3:36 PM
> To:
> Subject: Re: FW: Solr and Garbage Collection
> I'm not planning on lowering the heap. I just want to lower the time
> "wasted" on GC, which is 11% right now.So what I'll try is changing the GC
> to -XX:+UseConcMarkSweepGC
> On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:
> > Mark,
> >
> > what if piece of code needs 10 contiguous Kb to load a document field?
> > locked memory pieces are optimized/moved (putting on hold almost whole
> > application)?
> > Lowering heap is _bad_ idea; we will have extremely frequent GC
> > of
> > live objects!!!) even if RAM is (theoretically) enough.
> >
> > -Fuad
> >
> >
> > >Faud, you didn't read the thread right.
> > >
> > > He is not having a problem with OOM. He got the OOM because he lowered
> > > the heap to try and help GC.
> > >
> > > He normally runs with a heap that can handle his FC.
> > >
> > > Please re-read the thread. You are confusing the tread.
> > >
> > > - Mark
> > >
> >
> >
> > >> GC will frequently happen even if RAM is more than enough: in case if
> > is
> > >> heavily sparse... so that have even more RAM!
> > >> -Fuad
> >
> >
> >

Re: shards and facet_count

2009-09-25 Thread Paul Rosen
Sorry for the long delay in responding, but I've just gotten back to 
this problem...

I got the solr 1.4 nightly and the problem went away, so I guess it is a 
solr 1.3 bug.

Thanks for all the input!

Lance Norskog wrote:

Paul, can you create an HTTP url that does this exact query? With
multiple shards and facet requests?  And that does what you expect?
That would help the Ruby Dudes to figure out the discrepancy.


On Fri, Sep 18, 2009 at 7:01 PM, Yonik Seeley

On Fri, Sep 18, 2009 at 5:58 AM, Erik Hatcher  wrote:

It is strange that you get facet=false calls in there, but maybe this is
just normal distributed search protocol in one of the phases?

Right, on the second phase of a distrib request, additional faceting
may not be needed.

But it looks like the distributed request is being directed at two
different handlers rather than two different servers or cores?

I've never tried this, but from the log file, it doesn't look like the
sub-requests are going to those different handlers since the path is
always path=/select


Re: FW: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel
I'm not planning on lowering the heap. I just want to lower the time
"wasted" on GC, which is 11% right now.So what I'll try is changing the GC
to -XX:+UseConcMarkSweepGC

On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:

> Mark,
> what if piece of code needs 10 contiguous Kb to load a document field? How
> locked memory pieces are optimized/moved (putting on hold almost whole
> application)?
> Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize
> of
> live objects!!!) even if RAM is (theoretically) enough.
> -Fuad
> >Faud, you didn't read the thread right.
> >
> > He is not having a problem with OOM. He got the OOM because he lowered
> > the heap to try and help GC.
> >
> > He normally runs with a heap that can handle his FC.
> >
> > Please re-read the thread. You are confusing the tread.
> >
> > - Mark
> >
> >> GC will frequently happen even if RAM is more than enough: in case if it
> is
> >> heavily sparse... so that have even more RAM!
> >> -Fuad

Hierarchical Facet Field Prefix Not Working

2009-09-25 Thread Nasseam Elkarra

Hello all,

We are using the patch from SOLR-64 ( 
) to implement hierarchical facets for categories. We are trying to  
use the facet.prefix to prevent all categories from coming back.  
However, f.category.facet.prefix doesn't work. Using facet.prefix  
works but prevents the other facets from coming back since it is a  
global option. Are per facet options supported on hierarchical facet  
fields? If not, how can I get a specific category and it's children  
without getting the surrounding categories?

Any help is much appreciated.

Thank you,

Nasseam Elkarra
The fastest possible shopping experience.

RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi

what if piece of code needs 10 contiguous Kb to load a document field? How
locked memory pieces are optimized/moved (putting on hold almost whole
Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize of
live objects!!!) even if RAM is (theoretically) enough. 


>Faud, you didn't read the thread right.
> He is not having a problem with OOM. He got the OOM because he lowered
> the heap to try and help GC.
> He normally runs with a heap that can handle his FC.
> Please re-read the thread. You are confusing the tread.
> - Mark

>> GC will frequently happen even if RAM is more than enough: in case if it
>> heavily sparse... so that have even more RAM!
>> -Fuad

RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
> He is not having a problem with OOM. He got the OOM because he lowered
> the heap to try and help GC.

That is very confusing!!!

Lowering heap helps GC? Someone mentioned it in this thread, but my
viewpoint is completely opposite.

1. Some RAM is needed to_be_reserved for FieldCache (it will be populated
over time, kind of "memory leak" not-well-documented).
2. Some RAM is needed for the rest of application.

And, some pieces of code frequently need contiguous memory (100 bytes? 1000
bytes?), so that GC-optimize will run even if memory is more than
(theoretically) enough.

So that... have even more memory.

I had similar problems with 8Gb; I don't have any problem with 16Gb. And I
never waste time on GC optimization, server was running without OOM and any
performance issues during almost a year.

>> SEVERE: java.lang.OutOfMemoryError: Java heap space

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Faud, you didn't read the thread right.

He is not having a problem with OOM. He got the OOM because he lowered
the heap to try and help GC.

He normally runs with a heap that can handle his FC.

Please re-read the thread. You are confusing the tread.

- Mark

Fuad Efendi wrote:
> Guys, thanks for GC discussion; but the root of a problem is FieldCache
> internals.
> Not enough RAM for FieldCache will cause unpredictable OOM, and it does not
> depend on GC. How much RAM FieldCache needs in case of 2 different
> values for a Field, 200 bytes each (Unicode), and 100M documents? What if we
> have 100 such non-tokenized fields in a schema?
> SOLR has an option to warm up caches on startup which might help
> troubleshooting.
> JRockit JVM has 'realtime' version if you are interested in predictable GC
> (without delaying 'transaction')...
> GC will frequently happen even if RAM is more than enough: in case if it is
> heavily sparse... so that have even more RAM!
> -Original Message-
> From: Fuad Efendi [] 
> Sent: September-25-09 12:17 PM
> To:
> Subject: RE: Solr and Garbage Collection
>> You are saying that I should give more memory than 12GB?
> Yes. Look at this:
>>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>> 61
>>> )
> It can't find few (!!!) contiguous bytes for .createValue(...)
> It can't add (Field Value, Document ID) pair to an array.
> GC tuning won't help in this specific case...
> May be SOLR/Lucene core developers may WARM FieldCache at IndexReader
> opening time, in the future... to have early OOM...
> Avoiding faceting (and sorting) on such field will only postpone OOM to
> unpredictable date/time...
> -Fuad

- Mark

FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi
Guys, thanks for GC discussion; but the root of a problem is FieldCache

Not enough RAM for FieldCache will cause unpredictable OOM, and it does not
depend on GC. How much RAM FieldCache needs in case of 2 different
values for a Field, 200 bytes each (Unicode), and 100M documents? What if we
have 100 such non-tokenized fields in a schema?

SOLR has an option to warm up caches on startup which might help

JRockit JVM has 'realtime' version if you are interested in predictable GC
(without delaying 'transaction')...

GC will frequently happen even if RAM is more than enough: in case if it is
heavily sparse... so that have even more RAM!

-Original Message-
From: Fuad Efendi [] 
Sent: September-25-09 12:17 PM
Subject: RE: Solr and Garbage Collection

> You are saying that I should give more memory than 12GB?

Yes. Look at this:

> > SEVERE: java.lang.OutOfMemoryError: Java heap space
> 61
> > )

It can't find few (!!!) contiguous bytes for .createValue(...)

It can't add (Field Value, Document ID) pair to an array.

GC tuning won't help in this specific case...

May be SOLR/Lucene core developers may WARM FieldCache at IndexReader
opening time, in the future... to have early OOM...

Avoiding faceting (and sorting) on such field will only postpone OOM to
unpredictable date/time...


Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
This all applies to having more than once processor though - if you have
one processor, than non concurrent can also make sense.

But especially with the young space, you want concurrency - with upto
98% of objects being short lived, and multiple threads generally
creating new objects, its a huge boon to collect the young space

Mark Miller wrote:
> Walter Underwood wrote:
>> For batch-oriented computing, like Hadoop, the most efficient GC is probably
>> a non-concurrent, non-generational GC. 
> Okay - for batch we somewhat agree I guess - if you can stand any length
> of pausing, non concurrent can be nice, because you don't pay for thread
> sync communication. Only with a small heap size though (less than 100MB
> is what I've seen). You would pause the batch job while GC takes place.
> If you have 8 processors, and you are pausing all of them to collect a
> large heap using only 1 processor, that doesn't make much sense to me.
> The thread communication pain will be far outweighed by using more
> processors to do the collection faster, and not "stop the world" for
> your batch job so long. Stopping your application dead in its tracks,
> and then only using one of the available processors to collect a large
> heap, while the rest sit idle, doesn't make much sense.
> I also don't agree it ever really makes sense not to do generational
> collection. What is your argument here? Generational collection is
> **way** more efficient for short lived objects, which tend to be up to
> 98% of the objects in most applications. The only way I see that making
> sense is if you have almost no short lived objects (which occurs in
> what, .0001% of apps if at all?). The Sun JVM doesn't even offer a non
> generational approach anymore. It's just standard GC practice.
>> I doubt that there are many
>> batch-oriented applications of Solr, though.
>> The rest of the advice is intended to be general and it sounds like we agree
>> about sizing. If the nursery is not big enough, the tenured space will be
>> used for allocations that have a short lifetime and that will increase the
>> length and/or frequency of major collections.
> Yes - I wasn't arguing with every point - I was picking and choosing :)
> After the heap size, the size of the young generation is the most
> important factor.
>> Cache evictions are the interesting part, because they cause a constant rate
>> of tenured space garbage. In most many servers, you can get a big enough
>> nursery that major collections are very rare. That won't happen in Solr
>> because of cache evictions.
>> The IBM JVM is excellent. Their concurrent generational GC policy is
>> "gencon".
> Yeah, I actually know very little about the IBM JVM, so I wasn't really
> commenting. But from the info I gleaned here and on a couple quick web
> searches, I'm not too impressed by it's GC.
>> wunder
>> -Original Message-
>> From: Mark Miller [] 
>> Sent: Friday, September 25, 2009 10:31 AM
>> To:
>> Subject: Re: Solr and Garbage Collection
>> My bad - later, it looks as if your giving general advice, and thats
>> what I took issue with.
>> Any Collector that is not doing generational collection is essentially
>> from the dark ages and shouldn't be used.
>> Any Collector that doesn't have concurrent options, unless possibly your
>> running a tiny app (under 100MB of RAM), or only have a single CPU, is
>> also dark ages, and not fit for a server environement.
>> I havn't kept up with IBM's JVM, but it sounds like they are well behind
>> Sun in GC then.
>> - Mark
>> Walter Underwood wrote:
>>> As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
>>> pause" collector is only in the Sun JVM.
>>> I just found this excellent article about the various IBM GC options for a
>>> Lucene application with a 100GB heap:
>>> _h.html
>>> wunder
>>> -Original Message-
>>> From: Mark Miller [] 
>>> Sent: Friday, September 25, 2009 10:03 AM
>>> To:
>>> Subject: Re: Solr and Garbage Collection
>>> Walter Underwood wrote:
 30ms is not better or worse than 1s until you look at the service
 requirements. For many applications, it is worth dedicating 10% of your
 processing time to GC if that makes the worst-case pause short.

 On the other hand, my experience with the IBM JVM was that the maximum
>>> query
 rate was 2-3X better with the concurrent generational GC compared to any
>>> of
 their other GC algorithms, so we got the best throughput along with th

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
Walter Underwood wrote:
> For batch-oriented computing, like Hadoop, the most efficient GC is probably
> a non-concurrent, non-generational GC. 
Okay - for batch we somewhat agree I guess - if you can stand any length
of pausing, non concurrent can be nice, because you don't pay for thread
sync communication. Only with a small heap size though (less than 100MB
is what I've seen). You would pause the batch job while GC takes place.
If you have 8 processors, and you are pausing all of them to collect a
large heap using only 1 processor, that doesn't make much sense to me.
The thread communication pain will be far outweighed by using more
processors to do the collection faster, and not "stop the world" for
your batch job so long. Stopping your application dead in its tracks,
and then only using one of the available processors to collect a large
heap, while the rest sit idle, doesn't make much sense.

I also don't agree it ever really makes sense not to do generational
collection. What is your argument here? Generational collection is
**way** more efficient for short lived objects, which tend to be up to
98% of the objects in most applications. The only way I see that making
sense is if you have almost no short lived objects (which occurs in
what, .0001% of apps if at all?). The Sun JVM doesn't even offer a non
generational approach anymore. It's just standard GC practice.
> I doubt that there are many
> batch-oriented applications of Solr, though.
> The rest of the advice is intended to be general and it sounds like we agree
> about sizing. If the nursery is not big enough, the tenured space will be
> used for allocations that have a short lifetime and that will increase the
> length and/or frequency of major collections.
Yes - I wasn't arguing with every point - I was picking and choosing :)
After the heap size, the size of the young generation is the most
important factor.
> Cache evictions are the interesting part, because they cause a constant rate
> of tenured space garbage. In most many servers, you can get a big enough
> nursery that major collections are very rare. That won't happen in Solr
> because of cache evictions.
> The IBM JVM is excellent. Their concurrent generational GC policy is
> "gencon".
Yeah, I actually know very little about the IBM JVM, so I wasn't really
commenting. But from the info I gleaned here and on a couple quick web
searches, I'm not too impressed by it's GC.
> wunder
> -Original Message-
> From: Mark Miller [] 
> Sent: Friday, September 25, 2009 10:31 AM
> To:
> Subject: Re: Solr and Garbage Collection
> My bad - later, it looks as if your giving general advice, and thats
> what I took issue with.
> Any Collector that is not doing generational collection is essentially
> from the dark ages and shouldn't be used.
> Any Collector that doesn't have concurrent options, unless possibly your
> running a tiny app (under 100MB of RAM), or only have a single CPU, is
> also dark ages, and not fit for a server environement.
> I havn't kept up with IBM's JVM, but it sounds like they are well behind
> Sun in GC then.
> - Mark
> Walter Underwood wrote:
>> As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
>> pause" collector is only in the Sun JVM.
>> I just found this excellent article about the various IBM GC options for a
>> Lucene application with a 100GB heap:
>> _h.html
>> wunder
>> -Original Message-
>> From: Mark Miller [] 
>> Sent: Friday, September 25, 2009 10:03 AM
>> To:
>> Subject: Re: Solr and Garbage Collection
>> Walter Underwood wrote:
>>> 30ms is not better or worse than 1s until you look at the service
>>> requirements. For many applications, it is worth dedicating 10% of your
>>> processing time to GC if that makes the worst-case pause short.
>>> On the other hand, my experience with the IBM JVM was that the maximum
>> query
>>> rate was 2-3X better with the concurrent generational GC compared to any
>> of
>>> their other GC algorithms, so we got the best throughput along with the
>>> shortest pauses.
>> With which collector? Since the very early JVM's, all GC is generational.
>> Most of the collectors (other than the Serial Collector) also work
>> concurrently.
>> By default, they are concurrent on different generations, but you can
>> add concurrency
>> to the "other" generation with each now too.
>>> Solr garbage generation (for queries) seems to have two major components:
>>> per-request garbage and cache evictions. With a generational collector,
>>> these two are handled by separate parts of the collector.
>> Different parts of the 

Solr + Jboss + Custom Transformers

2009-09-25 Thread Papiya Misra


I am trying to use a custom transformer that extends

I have the CustomTransformer.jar and DataImportHandler.jar in
JBOSS/server/default/lib. I have the solr.war (as is from the distro) in
the JBOSS/server/default/deploy.

org.apache.solr.handler.dataimport.EntityProcessorWrapper (line 110)
returns false for the following code
 clazz.newInstance() instanceof Transformer

This happens because the CustomTransformer uses the Transformer from a
different ClassLoader than the Solr web application.

I could use the source code to create solr.war that includes the
CustomTransformer class. Is there any other option - one that preferably
does not include re-packaging solr.war ?


RE: Solr and Garbage Collection

2009-09-25 Thread Walter Underwood
For batch-oriented computing, like Hadoop, the most efficient GC is probably
a non-concurrent, non-generational GC. I doubt that there are many
batch-oriented applications of Solr, though.

The rest of the advice is intended to be general and it sounds like we agree
about sizing. If the nursery is not big enough, the tenured space will be
used for allocations that have a short lifetime and that will increase the
length and/or frequency of major collections.

Cache evictions are the interesting part, because they cause a constant rate
of tenured space garbage. In most many servers, you can get a big enough
nursery that major collections are very rare. That won't happen in Solr
because of cache evictions.

The IBM JVM is excellent. Their concurrent generational GC policy is


-Original Message-
From: Mark Miller [] 
Sent: Friday, September 25, 2009 10:31 AM
Subject: Re: Solr and Garbage Collection

My bad - later, it looks as if your giving general advice, and thats
what I took issue with.

Any Collector that is not doing generational collection is essentially
from the dark ages and shouldn't be used.

Any Collector that doesn't have concurrent options, unless possibly your
running a tiny app (under 100MB of RAM), or only have a single CPU, is
also dark ages, and not fit for a server environement.

I havn't kept up with IBM's JVM, but it sounds like they are well behind
Sun in GC then.

- Mark

Walter Underwood wrote:
> As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
> pause" collector is only in the Sun JVM.
> I just found this excellent article about the various IBM GC options for a
> Lucene application with a 100GB heap:
> _h.html
> wunder
> -Original Message-
> From: Mark Miller [] 
> Sent: Friday, September 25, 2009 10:03 AM
> To:
> Subject: Re: Solr and Garbage Collection
> Walter Underwood wrote:
>> 30ms is not better or worse than 1s until you look at the service
>> requirements. For many applications, it is worth dedicating 10% of your
>> processing time to GC if that makes the worst-case pause short.
>> On the other hand, my experience with the IBM JVM was that the maximum
> query
>> rate was 2-3X better with the concurrent generational GC compared to any
> of
>> their other GC algorithms, so we got the best throughput along with the
>> shortest pauses.
> With which collector? Since the very early JVM's, all GC is generational.
> Most of the collectors (other than the Serial Collector) also work
> concurrently.
> By default, they are concurrent on different generations, but you can
> add concurrency
> to the "other" generation with each now too.
>> Solr garbage generation (for queries) seems to have two major components:
>> per-request garbage and cache evictions. With a generational collector,
>> these two are handled by separate parts of the collector.
> Different parts of the collector? Its a different collector depending on
> the generation.
> The young generation is collected with a copy collector. This is because
> almost all the objects
> in the young generation are likely dead, and a copy collector only needs
> to visit live objects. So
> its very efficient. The tenured generation uses something more along the
> lines of mark and sweep or mark
> and compact.
>>  Per-request
>> garbage should completely fit in the short-term heap (nursery), so that
>> can be collected rapidly and returned to use for further requests. If the
>> nursery is too small, the per-request allocations will be made in tenured
>> space and sit there until the next major GC. Cache evictions are almost
>> always in long-term storage (tenured space) because an LRU algorithm
>> guarantees that the garbage will be old.
>> Check the growth rate of tenured space (under constant load, of course)
>> while increasing the size of the nursery. That rate should drop when the
>> nursery gets big enough, then not drop much further as it is increased
> more.
>> After that, reduce the size of tenured space until major GCs start
> happening
>> "too often" (a judgment call). A bigger tenured space means longer major
> GCs
>> and thus longer pauses, so you don't want it oversized by too much.
> With the concurrent low pause collector, the goal is to avoid "major"
> collections,
> by collecting *before* the tenured space is filled. If you you are
> getting "major" collections,
> you need to tune your settings - the whole point of that collector is to
> avoid "major"
> collections, and do almost all of the work while your application is not
> paused. There are
> still 2 brief pauses during the collection, but they should not be
> significant at all.
>> Al

8 for 1.4

2009-09-25 Thread Grant Ingersoll


We're down to 8 open issues:

2 are packaging related, one is dependent on the official 2.9 release  
(so should be taken care of today or tomorrow I suspect) and then we  
have a few others.

The only two somewhat major ones are S-1458, S-1294 (more on this in a  
mo') and S-1449.

On S-1294, the SolrJS patch, I yet again have concerns about even  
including this, given the lack of activity (from Matthias, the  
original author and others) and the fact that some in the Drupal  
community have already forked this to fix the various bugs in it  
instead of just submitting patches.  While I really like the idea of  
this library (jQuery is awesome), I have yet to see interest in the  
community to maintain it (unless you count someone forking it and  
fixing the bugs in the fork as maintenance) and I'll be upfront in  
admitting I have neither the time nor the patience to debug Javascript  
across the gazillions of browsers out there (I don't even have IE on  
my machine unless you count firing up a VM w/ XP on it) in the wild.   
Given what I know of most of the other committers here, I suspect that  
is true for others too.  At a minimum, I think S-1294 should be pushed  
to 1.5.  Next up, I think we consider pulling SolrJS from the release,  
but keeping it in trunk and officially releasing it with either 1.5 or  
1.4.1, assuming its gotten some love in the meantime.  If by then it  
has no love, I vote we remove it and let the fork maintain it and  
point people there.


Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
My bad - later, it looks as if your giving general advice, and thats
what I took issue with.

Any Collector that is not doing generational collection is essentially
from the dark ages and shouldn't be used.

Any Collector that doesn't have concurrent options, unless possibly your
running a tiny app (under 100MB of RAM), or only have a single CPU, is
also dark ages, and not fit for a server environement.

I havn't kept up with IBM's JVM, but it sounds like they are well behind
Sun in GC then.

- Mark

Walter Underwood wrote:
> As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
> pause" collector is only in the Sun JVM.
> I just found this excellent article about the various IBM GC options for a
> Lucene application with a 100GB heap:
> _h.html
> wunder
> -Original Message-
> From: Mark Miller [] 
> Sent: Friday, September 25, 2009 10:03 AM
> To:
> Subject: Re: Solr and Garbage Collection
> Walter Underwood wrote:
>> 30ms is not better or worse than 1s until you look at the service
>> requirements. For many applications, it is worth dedicating 10% of your
>> processing time to GC if that makes the worst-case pause short.
>> On the other hand, my experience with the IBM JVM was that the maximum
> query
>> rate was 2-3X better with the concurrent generational GC compared to any
> of
>> their other GC algorithms, so we got the best throughput along with the
>> shortest pauses.
> With which collector? Since the very early JVM's, all GC is generational.
> Most of the collectors (other than the Serial Collector) also work
> concurrently.
> By default, they are concurrent on different generations, but you can
> add concurrency
> to the "other" generation with each now too.
>> Solr garbage generation (for queries) seems to have two major components:
>> per-request garbage and cache evictions. With a generational collector,
>> these two are handled by separate parts of the collector.
> Different parts of the collector? Its a different collector depending on
> the generation.
> The young generation is collected with a copy collector. This is because
> almost all the objects
> in the young generation are likely dead, and a copy collector only needs
> to visit live objects. So
> its very efficient. The tenured generation uses something more along the
> lines of mark and sweep or mark
> and compact.
>>  Per-request
>> garbage should completely fit in the short-term heap (nursery), so that it
>> can be collected rapidly and returned to use for further requests. If the
>> nursery is too small, the per-request allocations will be made in tenured
>> space and sit there until the next major GC. Cache evictions are almost
>> always in long-term storage (tenured space) because an LRU algorithm
>> guarantees that the garbage will be old.
>> Check the growth rate of tenured space (under constant load, of course)
>> while increasing the size of the nursery. That rate should drop when the
>> nursery gets big enough, then not drop much further as it is increased
> more.
>> After that, reduce the size of tenured space until major GCs start
> happening
>> "too often" (a judgment call). A bigger tenured space means longer major
> GCs
>> and thus longer pauses, so you don't want it oversized by too much.
> With the concurrent low pause collector, the goal is to avoid "major"
> collections,
> by collecting *before* the tenured space is filled. If you you are
> getting "major" collections,
> you need to tune your settings - the whole point of that collector is to
> avoid "major"
> collections, and do almost all of the work while your application is not
> paused. There are
> still 2 brief pauses during the collection, but they should not be
> significant at all.
>> Also check the hit rates of your caches. If the hit rate is low, say 20%
> or
>> less, make that cache much bigger or set it to zero. Either one will
> reduce
>> the number of cache evictions. If you have an HTTP cache in front of Solr,
>> zero may be the right choice, since the HTTP cache is cherry-picking the
>> easily cacheable requests.
>> Note that a commit nearly doubles the memory required, because you have
> two
>> live Searcher objects with all their caches. Make sure you have headroom
> for
>> a commit.
>> If you want to test the tenured space usage, you must test with real world
>> queries. Those are the only way to get accurate cache eviction rates.
>> wunder
>> -Original Message-
>> From: Jonathan Ariel [] 
>> Sent: Friday, September 25, 2009 9:34 AM
>> To:
>> Subject: Re: Solr and Garbage Collection
>> BTW why making them equal will lower the 

Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel
Ok. I will try with the "concurrent low pause" collector and let you know
the results.
On Fri, Sep 25, 2009 at 2:23 PM, Walter Underwood wrote:

> As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
> pause" collector is only in the Sun JVM.
> I just found this excellent article about the various IBM GC options for a
> Lucene application with a 100GB heap:
> _h.html
> wunder
> -Original Message-
> From: Mark Miller []
> Sent: Friday, September 25, 2009 10:03 AM
> To:
> Subject: Re: Solr and Garbage Collection
> Walter Underwood wrote:
> > 30ms is not better or worse than 1s until you look at the service
> > requirements. For many applications, it is worth dedicating 10% of your
> > processing time to GC if that makes the worst-case pause short.
> >
> > On the other hand, my experience with the IBM JVM was that the maximum
> query
> > rate was 2-3X better with the concurrent generational GC compared to any
> of
> > their other GC algorithms, so we got the best throughput along with the
> > shortest pauses.
> >
> With which collector? Since the very early JVM's, all GC is generational.
> Most of the collectors (other than the Serial Collector) also work
> concurrently.
> By default, they are concurrent on different generations, but you can
> add concurrency
> to the "other" generation with each now too.
> > Solr garbage generation (for queries) seems to have two major components:
> > per-request garbage and cache evictions. With a generational collector,
> > these two are handled by separate parts of the collector.
> Different parts of the collector? Its a different collector depending on
> the generation.
> The young generation is collected with a copy collector. This is because
> almost all the objects
> in the young generation are likely dead, and a copy collector only needs
> to visit live objects. So
> its very efficient. The tenured generation uses something more along the
> lines of mark and sweep or mark
> and compact.
> >  Per-request
> > garbage should completely fit in the short-term heap (nursery), so that
> it
> > can be collected rapidly and returned to use for further requests. If the
> > nursery is too small, the per-request allocations will be made in tenured
> > space and sit there until the next major GC. Cache evictions are almost
> > always in long-term storage (tenured space) because an LRU algorithm
> > guarantees that the garbage will be old.
> >
> > Check the growth rate of tenured space (under constant load, of course)
> > while increasing the size of the nursery. That rate should drop when the
> > nursery gets big enough, then not drop much further as it is increased
> more.
> >
> > After that, reduce the size of tenured space until major GCs start
> happening
> > "too often" (a judgment call). A bigger tenured space means longer major
> GCs
> > and thus longer pauses, so you don't want it oversized by too much.
> >
> With the concurrent low pause collector, the goal is to avoid "major"
> collections,
> by collecting *before* the tenured space is filled. If you you are
> getting "major" collections,
> you need to tune your settings - the whole point of that collector is to
> avoid "major"
> collections, and do almost all of the work while your application is not
> paused. There are
> still 2 brief pauses during the collection, but they should not be
> significant at all.
> > Also check the hit rates of your caches. If the hit rate is low, say 20%
> or
> > less, make that cache much bigger or set it to zero. Either one will
> reduce
> > the number of cache evictions. If you have an HTTP cache in front of
> Solr,
> > zero may be the right choice, since the HTTP cache is cherry-picking the
> > easily cacheable requests.
> >
> > Note that a commit nearly doubles the memory required, because you have
> two
> > live Searcher objects with all their caches. Make sure you have headroom
> for
> > a commit.
> >
> > If you want to test the tenured space usage, you must test with real
> world
> > queries. Those are the only way to get accurate cache eviction rates.
> >
> > wunder
> >
> >
> > BTW why making them equal will lower the frequency of GC?
> >
> >
> >>> Bigger heaps lead to bigger GC pauses in general.
> >>>
> >> Opposite viewpoint:
> >> 1sec GC happening once an hour is MUCH BETTER than 30ms GC
> >>
> > once-per-second.
> >
> >> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
> >>
> >> Use -server option.
> >>
> >> -server option of JVM is 'native CPU code', I remember WebLogic 7
> console
> >> with SUN JVM 1.3 not showing any GC (just horizontal line).

RE: Solr and Garbage Collection

2009-09-25 Thread Walter Underwood
As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
pause" collector is only in the Sun JVM.

I just found this excellent article about the various IBM GC options for a
Lucene application with a 100GB heap:


Walter Underwood wrote:
> 30ms is not better or worse than 1s until you look at the service
> requirements. For many applications, it is worth dedicating 10% of your
> processing time to GC if that makes the worst-case pause short.
> On the other hand, my experience with the IBM JVM was that the maximum
> rate was 2-3X better with the concurrent generational GC compared to any
> their other GC algorithms, so we got the best throughput along with the
> shortest pauses.
With which collector? Since the very early JVM's, all GC is generational.
Most of the collectors (other than the Serial Collector) also work
By default, they are concurrent on different generations, but you can
add concurrency
to the "other" generation with each now too.
> Solr garbage generation (for queries) seems to have two major components:
> per-request garbage and cache evictions. With a generational collector,
> these two are handled by separate parts of the collector.
Different parts of the collector? Its a different collector depending on
the generation.
The young generation is collected with a copy collector. This is because
almost all the objects
in the young generation are likely dead, and a copy collector only needs
to visit live objects. So
its very efficient. The tenured generation uses something more along the
lines of mark and sweep or mark
and compact.
>  Per-request
> garbage should completely fit in the short-term heap (nursery), so that it
> can be collected rapidly and returned to use for further requests. If the
> nursery is too small, the per-request allocations will be made in tenured
> space and sit there until the next major GC. Cache evictions are almost
> always in long-term storage (tenured space) because an LRU algorithm
> guarantees that the garbage will be old.
> Check the growth rate of tenured space (under constant load, of course)
> while increasing the size of the nursery. That rate should drop when the
> nursery gets big enough, then not drop much further as it is increased
> After that, reduce the size of tenured space until major GCs start
> "too often" (a judgment call). A bigger tenured space means longer major
> and thus longer pauses, so you don't want it oversized by too much.
With the concurrent low pause collector, the goal is to avoid "major"
by collecting *before* the tenured space is filled. If you you are
getting "major" collections,
you need to tune your settings - the whole point of that collector is to
avoid "major"
collections, and do almost all of the work while your application is not
paused. There are
still 2 brief pauses during the collection, but they should not be
significant at all.
> Also check the hit rates of your caches. If the hit rate is low, say 20%
> less, make that cache much bigger or set it to zero. Either one will
> the number of cache evictions. If you have an HTTP cache in front of Solr,
> zero may be the right choice, since the HTTP cache is cherry-picking the
> easily cacheable requests.
> Note that a commit nearly doubles the memory required, because you have
> live Searcher objects with all their caches. Make sure you have headroom
> a commit.
> If you want to test the tenured space usage, you must test with real world
> queries. Those are the only way to get accurate cache eviction rates.
> wunder
> BTW why making them equal will lower the frequency of GC?
> On 9/25/09, Fuad Efendi  wrote:
>>> Bigger heaps lead to bigger GC pauses in general.
>> Opposite viewpoint:
>> 1sec GC happening once an hour is MUCH BETTER than 30ms GC
> once-per-second.
>> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
>> Use -server option.
>> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
>> with SUN JVM 1.3 not showing any GC (just horizontal line).
- Mark

boost function for date as unix stamp

2009-09-25 Thread Joe Calderon
hello *, i read on the wiki about using recip(rord(...)...) to boost
recent documents with a date field, does anyone have a good function
for doing something similar with unix timestamps?

if not, is there a lot of overhead related to counting the number of
distinct values for rord() ?

thx much


Re: download pre-release nightly solr 1.4

2009-09-25 Thread Mark Miller
michael8 wrote:
> markrmiller wrote:
>> michael8 wrote:
>>> Hi,
>>> I know Solr 1.4 is going to be released any day now pending Lucene 2.9
>>> release.  Is there anywhere where one can download a pre-released nighly
>>> build of Solr 1.4 just for getting familiar with new features (e.g. field
>>> collapsing)?
>>> Thanks,
>>> Michael
>> You can download nightlies
>> here:
>> field collapsing won't be in 1.4 though. You have to build from svn
>> after applying the patch for that.
>> -- 
>> - Mark
- Mark

2009-09-25 Thread Mark Miller

2009-09-25 Thread Mark Miller
Walter Underwood wrote:
> 30ms is not better or worse than 1s until you look at the service
> requirements. For many applications, it is worth dedicating 10% of your
> processing time to GC if that makes the worst-case pause short.
> On the other hand, my experience with the IBM JVM was that the maximum query
> rate was 2-3X better with the concurrent generational GC compared to any of
> their other GC algorithms, so we got the best throughput along with the
> shortest pauses.
With which collector? Since the very early JVM's, all GC is generational.
Most of the collectors (other than the Serial Collector) also work
By default, they are concurrent on different generations, but you can
add concurrency
to the "other" generation with each now too.
> Solr garbage generation (for queries) seems to have two major components:
> per-request garbage and cache evictions. With a generational collector,
> these two are handled by separate parts of the collector.
Different parts of the collector? Its a different collector depending on
the generation.
The young generation is collected with a copy collector. This is because
almost all the objects
in the young generation are likely dead, and a copy collector only needs
to visit live objects. So
its very efficient. The tenured generation uses something more along the
lines of mark and sweep or mark
and compact.
>  Per-request
> garbage should completely fit in the short-term heap (nursery), so that it
> can be collected rapidly and returned to use for further requests. If the
> nursery is too small, the per-request allocations will be made in tenured
> space and sit there until the next major GC. Cache evictions are almost
> always in long-term storage (tenured space) because an LRU algorithm
> guarantees that the garbage will be old.
> Check the growth rate of tenured space (under constant load, of course)
> while increasing the size of the nursery. That rate should drop when the
> nursery gets big enough, then not drop much further as it is increased more.
> After that, reduce the size of tenured space until major GCs start happening
> "too often" (a judgment call). A bigger tenured space means longer major GCs
> and thus longer pauses, so you don't want it oversized by too much.
With the concurrent low pause collector, the goal is to avoid "major"
by collecting *before* the tenured space is filled. If you you are
getting "major" collections,
you need to tune your settings - the whole point of that collector is to
avoid "major"
collections, and do almost all of the work while your application is not
paused. There are
still 2 brief pauses during the collection, but they should not be
significant at all.
> Also check the hit rates of your caches. If the hit rate is low, say 20% or
> less, make that cache much bigger or set it to zero. Either one will reduce
- Mark

2009-09-25 thread michael8

2009-09-25 Thread michael8

markrmiller wrote:
> michael8 wrote:
>> Hi,
>> I know Solr 1.4 is going to be released any day now pending Lucene 2.9
>> release.  Is there anywhere where one can download a pre-released nighly
>> build of Solr 1.4 just for getting familiar with new features (e.g. field
>> collapsing)?
>> Thanks,
>> Michael
> You can download nightlies
> here:
> field collapsing won't be in 1.4 though. You have to build from svn
> after applying the patch for that.
> -- 
> - Mark

Thanks for the info Mark.  If field collapsing is a patch, can I apply the
patch against 1.3 then?  Thanks again.


View this message in context:
Sent from the Solr - User mailing list archive at

RE: Solr and Garbage Collection

2009-09-25 Thread Walter Underwood
30ms is not better or worse than 1s until you look at the service
requirements. For many applications, it is worth dedicating 10% of your
processing time to GC if that makes the worst-case pause short.

On the other hand, my experience with the IBM JVM was that the maximum query
rate was 2-3X better with the concurrent generational GC compared to any of
their other GC algorithms, so we got the best throughput along with the
shortest pauses.

Solr garbage generation (for queries) seems to have two major components:
per-request garbage and cache evictions. With a generational collector,
these two are handled by separate parts of the collector. Per-request
garbage should completely fit in the short-term heap (nursery), so that it
can be collected rapidly and returned to use for further requests. If the
nursery is too small, the per-request allocations will be made in tenured
space and sit there until the next major GC. Cache evictions are almost
always in long-term storage (tenured space) because an LRU algorithm
guarantees that the garbage will be old.

Check the growth rate of tenured space (under constant load, of course)
while increasing the size of the nursery. That rate should drop when the
nursery gets big enough, then not drop much further as it is increased more.

After that, reduce the size of tenured space until major GCs start happening
"too often" (a judgment call). A bigger tenured space means longer major GCs
and thus longer pauses, so you don't want it oversized by too much.

Also check the hit rates of your caches. If the hit rate is low, say 20% or
less, make that cache much bigger or set it to zero. Either one will reduce
the number of cache evictions. If you have an HTTP cache in front of Solr,
zero may be the right choice, since the HTTP cache is cherry-picking the
easily cacheable requests.

Note that a commit nearly doubles the memory required, because you have two
live Searcher objects with all their caches. Make sure you have headroom for
a commit.

If you want to test the tenured space usage, you must test with real world
queries. Those are the only way to get accurate cache eviction rates.


-Original Message-
From: Jonathan Ariel [] 
Sent: Friday, September 25, 2009 9:34 AM
Subject: Re: Solr and Garbage Collection

BTW why making them equal will lower the frequency of GC?

On 9/25/09, Fuad Efendi  wrote:
>> Bigger heaps lead to bigger GC pauses in general.
> Opposite viewpoint:
> 1sec GC happening once an hour is MUCH BETTER than 30ms GC
> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
> Use -server option.
> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
> with SUN JVM 1.3 not showing any GC (just horizontal line).
> -Fuad

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
>-server option of JVM is 'native CPU code', I remember WebLogic 7 console
>with SUN JVM 1.3 not showing any GC (just horizontal line).

Not sure what that is all about either. -server and -client are just two
different versions of hotspot.
The -server version is optimized for long running applications - it
starts slower, and over time, it learns
about your app and makes good throughput optimizations.

The -client hotspot version works faster quicker, and does concentrate
more on response than throughput.
Better for desktop apps. -server is better for long lived server apps.

Mark Miller wrote:
> It won't really - it will just keep the JVM from wasting time resizing
> the heap on you. Since you know you need so much RAM anyway, no reason
> not to just pin it at what you need.
> Not going to help you much with GC though.
> Jonathan Ariel wrote:
>> BTW why making them equal will lower the frequency of GC?
>> On 9/25/09, Fuad Efendi  wrote:
 Bigger heaps lead to bigger GC pauses in general.
>>> Opposite viewpoint:
>>> 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second.
>>> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
>>> Use -server option.
>>> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
>>> with SUN JVM 1.3 not showing any GC (just horizontal line).
>>> -Fuad

- Mark

Re: Faceted Search on Dynamic Fields?

2009-09-25 thread Yonik Seeley
On Fri, Sep 25, 2009 at 12:19 PM, Avlesh Singh  wrote:
> Faceting, as of now, can only be done of definitive field names.

To further clarify, the fields you can facet on can include those
defined by dynamic fields.  You just must specify the exact field name
when you facet.


Did you really mean for the ampersand to be in the dynamic field name?
 I'd advise against this, and it could be the source of your problems
(escaping the ampersand in your request, etc).

What is the exact facet request you are sending?


Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller
It won't really - it will just keep the JVM from wasting time resizing
the heap on you. Since you know you need so much RAM anyway, no reason
not to just pin it at what you need.
Not going to help you much with GC though.

Jonathan Ariel wrote:
> BTW why making them equal will lower the frequency of GC?
> On 9/25/09, Fuad Efendi  wrote:
>>> Bigger heaps lead to bigger GC pauses in general.
>> Opposite viewpoint:
>> 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second.
>> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
>> Use -server option.
>> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
>> with SUN JVM 1.3 not showing any GC (just horizontal line).
- Mark

2009-09-25 thread Jonathan Ariel

2009-09-25 Thread Jonathan Ariel
BTW why making them equal will lower the frequency of GC?

On 9/25/09, Fuad Efendi  wrote:
>> Bigger heaps lead to bigger GC pauses in general.
> Opposite viewpoint:
> 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second.
> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
> Use -server option.
> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
> with SUN JVM 1.3 not showing any GC (just horizontal line).
2009-09-25 thread Jonathan Ariel

2009-09-25 Thread Jonathan Ariel
I can't really understand how increasing the heap will decrease the
11% dedicated to GC

On 9/25/09, Fuad Efendi  wrote:
>> You are saying that I should give more memory than 12GB?
> Yes. Look at this:
>> > SEVERE: java.lang.OutOfMemoryError: Java heap space
>> 61
>> > )
> It can't find few (!!!) contiguous bytes for .createValue(...)
> It can't add (Field Value, Document ID) pair to an array.
> GC tuning won't help in this specific case...
> May be SOLR/Lucene core developers may WARM FieldCache at IndexReader
> opening time, in the future... to have early OOM...
> Avoiding faceting (and sorting) on such field will only postpone OOM to
> unpredictable date/time...
2009-09-25 thread cbennett

You are saying that I should give more memory than 12GB?
When I was with 10GB I had the exceptions that I sent. Switching to 12GB
made them disappear.
So I think I don't have problems with FieldCache any more. What it seems
like a problem is 11% on the application time dedicated to GC. Specially
when those servers are under really heavy load.
I think that's why I sometimes get queries that in one moment are being
executed in a few ms and a moment after 20 seconds!

It seems like I should tune my jvm, don't you think so?

On Fri, Sep 25, 2009 at 1:01 PM, Fuad Efendi  wrote:

> Give it even more memory.
> Lucene FieldCache is used to store non-tokenized single-value non-boolean
> (DocumentId -> FieldValue) pairs, and it is used (in-full!) for instance
> for
> sorting query results.
> So that if you have 100,000,000 documents with specific heavily
> field values (cardinality is high! Size is 100bytes!) you need
> 10,000,000,000 bytes for just this instance of FieldCache.
> GC does not play any role. FieldCache won't be GC-collected.
> -Fuad
> > -Original Message-
> > From: Jonathan Ariel []
> > Sent: September-25-09 11:37 AM
> > To:;
> > Subject: Re: Solr and Garbage Collection
> >
> > Right, now I'm giving it 12GB of heap memory.
> > If I give it less (10GB) it throws the following exception:
> >
> > Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log
> > SEVERE: java.lang.OutOfMemoryError: Java heap space
> > at
> >
> 61
> > )
> > at
> >$Cache.get(
> > at
> >
> 52
> > )
> > at
> >
> 67
> > )
> > at
> >
> > at
> >
> 07
> > )
> > at
> >
> org.apache.solr.request.SimpleFacets.getFacetCounts(
> > at
> >
> :7
> > 0)
> > at
> >
> le
> >
> > at
> >
> ja
> > va:131)
> > at org.apache.solr.core.SolrCore.execute(
> > at
> >
> 03
> > )
> > at
> >
> 23
> > 2)
> > at
> >
> .j
> > ava:1089)
> > at
> > org.mortbay.jetty.servlet.ServletHandler.handle(
> > at
> >
> > at
> > org.mortbay.jetty.servlet.SessionHandler.handle(
> > at
> > org.mortbay.jetty.handler.ContextHandler.handle(
> > at
> > org.mortbay.jetty.webapp.WebAppContext.handle(
> > at
> >
2009-09-25 thread Avlesh Singh

2009-09-25 thread Mark Miller

2009-09-25 thread Fuad Efendi

2009-09-25 thread Jonathan Ariel

2009-09-25 thread Fuad Efendi

2009-09-25 thread Fuad Efendi

2009-09-25 thread Gerald Snyder

2009-09-25 thread Mark Miller

2009-09-25 thread cbennett

2009-09-25 thread Jonathan Ariel

2009-09-25 thread Grant Ingersoll

2009-09-25 thread danben

2009-09-25 thread danben

2009-09-25 thread Ensdorf Ken

2009-09-25 thread Michael

2009-09-25 thread Michael

2009-09-25 thread Yonik Seeley

2009-09-25 thread Carr, Adrian

2009-09-25 thread Yonik Seeley

2009-09-25 thread Jonathan Ariel

2009-09-25 thread Carr, Adrian

2009-09-25 thread Brahim Abdesslam

2009-09-25 thread Phillip Farber

2009-09-25 thread Jérôme Etévé

2009-09-25 thread Avlesh Singh

2009-09-25 thread Peter Ledbrook

2009-09-25 thread Kundig, Andreas

2009-09-25 thread Marian Steinbach

2009-09-25 thread Avlesh Singh

