date:20090925

Re: Hierarchical Facet Field Prefix Not Working

2009-09-25 Thread Koji Sekiguchi


Hi Nasseam,

I think per field parameter for facet.prefix should be worked
on hierarchical facet fields by briefly looking at the patch.
And I can get same facet results by:

&facet=on&facet.field=hiefacet&facet.prefix=A/B/

and

&facet=on&facet.field=hiefacet&f.hiefacet.facet.prefix=A/B/

when using sample data in the thread SOLR-64.
It is likely I'm missing something.

Koji

Nasseam Elkarra wrote:

Hello all,

We are using the patch from SOLR-64 
(http://issues.apache.org/jira/browse/SOLR-64) to implement 
hierarchical facets for categories. We are trying to use the 
facet.prefix to prevent all categories from coming back. However, 
f.category.facet.prefix doesn't work. Using facet.prefix works but 
prevents the other facets from coming back since it is a global 
option. Are per facet options supported on hierarchical facet fields? 
If not, how can I get a specific category and it's children without 
getting the surrounding categories?


Any help is much appreciated.

Thank you,

Nasseam Elkarra
http://bodukai.com/boutique/
The fastest possible shopping experience.

Re: Mixed field types and boolean searching

2009-09-25 Thread Lance Norskog

The DisMax parser essentially creates a set of queries against
different fields. These queries are analyzed as per each field.

I think this what you are talking about- "The" in a movie title is
diffferent from "the" in the movie description. Would you expect "The
Sound Of Music" to fetch every movie in the database? So "the" is a
stopword in the description but is not in the title.

Also, the DisMax parser has no OR. It has +, - and "at least one of
and more is better". The query "A B" means "A or B but both is
better". "+a +b" means "a AND B". "+a b" means "must have 'a' but is
better with 'b'".

On Fri, Sep 25, 2009 at 7:04 AM, Ensdorf Ken  wrote:
>> No- there are various analyzers. StandardAnalyzer is geared toward
>> searching bodies of text for interesting words -  punctuation is
>> ripped out. Other analyzers are more useful for "concrete" text. You
>> may have to work at finding one that leaves punctuation in.
>>
>
> My problem is not with the StandardAnalyzer per se, but more as to how 
> "dismax" style queries are handled by the query parser when the different 
> fields have different sets of ignored tokens or stop words.
>
> Say you want to use the contents of a text box in your app and query a field 
> in Solr.  The user enters "A and B", so you map this to "f1:A and f1:B".  
> Now, if "B" is an ignored token in the "f1" field for whatever reason, the 
> query boils down to "f1:A".
>
> Now imagine you want to allow the user's text to match multiple fields - as 
> in any term can match any field, but all terms must match at least 1 field.  
> So now you map the user's query to "(f1:A OR f2:A) AND (f1:B OR f2:B)".  But 
> if f2 does not ignore "B", the query boils down to "(f1:A OR f2:A) AND 
> (f2:B)".  Now documents that could come back when you were only matching 
> against the f1 field don't come back.
>
> This seems counter-intuitive - to be consistent, I would think the query 
> should essentially be treated as "(f1:A OR f2:A) AND (TRUE OR f2:B) " - and 
> thus a term that is a stop word or ignored token for any of the fields would 
> be ignored across the board.
>
> So I guess what I'm asking is if there is a reason for the existing behavior, 
> or is it just a fact-of-life of the query parser?  Thanks!
>
> -Ken
>



-- 
Lance Norskog
goks...@gmail.com

Problem changing the default MergePolicy/Scheduler

2009-09-25 Thread Jibo John


Hello,

It looks like solr is not allowing me to change the default  
MergePolicy/Scheduler classes.


Even if I change the default MergePolicy/ 
Scheduler(LogByteSizeMErgePolicy and ConcurrentMergeScheduler) defined  
in solrconfig.xml to a different one (LogDocMergePolicy and  
SerialMergeScheduler), my profiler shows the default classes are still  
being loaded.


Also, if I use the default LogByteSizeMergePolicy, I can't seem to  
override the 'calibrateSizeByDeletes' to 'true' value using solrconfig  
using the new syntax that was introduced this week (SOLR-1447).


I'm using the version checked out from trunk yesterday.

Any pointers will be helpful.

Thanks,
-Jibo

Re: Showcase: Facetted Search for Wine using Solr

2009-09-25 Thread Lance Norskog

Have you seen this? It is another Solr/Typeo3 integration project.

http://forge.typo3.org/projects/show/extension-solr

Would you consider open-sourcing your Solr/Typo3 integration?

On Fri, Sep 25, 2009 at 1:18 AM, Marian Steinbach
 wrote:
> Hi Grant!
>
> Thanks for the advidce, I added the link to the list.
>
> Regards,
>
> Marian
>
>
> On Fri, Sep 25, 2009 at 5:14 AM, Grant Ingersoll  wrote:
>> Hi Marian,
>>
>> Looks great!  Wish I could order some wine.  When you get a chance, please
>> add the site to http://wiki.apache.org/solr/PublicServers!
>>
>> Cheers,
>> Grant
>>
>> On Sep 24, 2009, at 11:51 AM, marian.steinbach wrote:
>>
>>> Hello everybody!
>>>
>>> The purpose of this mail is to say "thank you" to the creators of Solr
>>> and to the community that supports it.
>>>
>>> We released our first project using Solr several weeks ago, after
>>> having tested Solr for several months.
>>>
>>> The project I'm talking about is a product search for an online wine
>>> shop (sorry, german user interface only):
>>>
>>>  http://www.koelner-weinkeller.de/index.php?id=sortiment
>>>
>>> Our client offers about 3000 different wines and other related products.
>>>
>>> Before we introduced Solr, the products have been searched via
>>> complicated and slow SQL statements, with all kinds problems related
>>> to that. No full text indexing, no stemming etc.
>>>
>>> We are happy to make use of several built-in features which solve
>>> problems that bugged us: Facetted search, german accents and stemming
>>> and synonyms beeing the most important ones.
>>>
>>> The surrounding website is TYPO3 driven. We integrated Solr by
>>> creating our own frontend plugin which talks to the Solr webservice
>>> (and we're very happy about the PHP output type!).
>>>
>>> I'd be glad about your comments.
>>>
>>> Cheers,
>>>
>>> Marian
>>
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Solr http post performance seems slow - help?

2009-09-25 Thread Lance Norskog

Your indexing project is disk-bound. My modern midrange laptop gets
30MB/s doing "cat > /dev/null" (1 7200rpm disk). The Amazon instances
I'm playing with get 50-60 (I really want to know how it fits
together). Your laptop might be 10-20?

On Thu, Sep 24, 2009 at 11:54 PM, Constantijn Visinescu
 wrote:
> This may or may not help but here goes :)
>
> When i was running performance tests i look a look at the simple post tool
> that comes with the solr examples.
>
> First i changed my schema.xml to fit my needs and then i deleted the old
> index so solr created a blank one when i started up.
> Then i had a had a process chew on my data and spit out xml files that are
> formatted similarly to the xml files that the SimplePostTool example uses.
> Next i used the simple Post tool to post the xml files to solr (60k-80k
> records per xml file). Each file only took a couple minutes to index this
> way.
> Comit and optimize after that (took less then 10 minutes) and after about
> 2.5 hrs i had indexed just under 8 milion records.
>
> This was on a 4 year old single core laptop using resin 3 as my servlet
> container.
>
> Hope this helps.
>
>
> On Fri, Sep 25, 2009 at 3:51 AM, Lance Norskog  wrote:
>
>> In "top", press the '1' key. This will give a list of the CPUs and how
>> much load is on each. The display is otherwise a little weird for
>> multi-cpu machines. But don't be surprised when Solr is I/O bound. The
>> biggest fanciest RAID is often a better investment than CPUs. On one
>> project we bought low-end rack servers come with 6-8 disk bays,
>> filling them with 10k/15k RPM disks.
>>
>> On Wed, Sep 23, 2009 at 2:47 PM, Dan A. Dickey 
>> wrote:
>> > On Friday 11 September 2009 11:06:20 am Dan A. Dickey wrote:
>> > ...
>> >> Our JBoss expert and I will be looking into why this might be occurring.
>> >> Does anyone know of any JBoss related slowness with Solr?
>> >> And does anyone have any other sort of suggestions to speed indexing
>> >> performance?   Thanks for your help all!  I'll keep you up to date with
>> >> further progress.
>> >
>> > Ok, further progress... just to keep any interested parties up to date
>> > and for the record...
>> >
>> > I'm finding that using the "example" jetty setup (will be switching very
>> > very soon to a "real" jetty installation) is about the fastest.  Using
>> > several processes to send posts to Solr helps a lot, and we're seeing
>> > about 80 posts a second this way.
>> >
>> > We also stripped down JBoss to the bare bones and the Solr in it
>> > is running nearly as fast - about 50 posts a second.  It was our previous
>> > JBoss configuration that was making it appear "slow" for some reason.
>> >
>> > We will be running more tests and spreading out the "pre-index" workload
>> > across more machines and more processes. In our case we were seeing
>> > the bottleneck being one machine running 18 processes.
>> > The 2 quad core xeon system is experiencing about a 25% cpu load.
>> > And I'm not certain, but I think this may be actually 25% of one of the 8
>> cores.
>> > So, there's *lots* of room for Solr to be doing more work there.
>> >        -Dan
>> >
>> > --
>> > Dan A. Dickey | Senior Software Engineer
>> >
>> > Savvis
>> > 10900 Hampshire Ave. S., Bloomington, MN  55438
>> > Office: 952.852.4803 | Fax: 952.852.4951
>> > E-mail: dan.dic...@savvis.net
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: problem with HTMLStripStandardTokenizerFactory

2009-09-25 Thread Yonik Seeley

Can you give a small test file that demonstrates the problem?

-Yonik
http://www.lucidimagination.com



On Fri, Sep 25, 2009 at 5:34 AM, Kundig, Andreas
 wrote:
> Hello
>
> I can't bring HTMLStripStandardTokenizerFactory to remove the content of the 
> style tag, as the documentation says it should.
>
> A search for 'mso' returns a document where the search term only appears in 
> the style tag (it's a word document saved as html). Here is the highlight 
> returned by solr (by the way: the wrong word is highlighted).
>
> "vetica;
\n\tpanose-1:2 11 5 4 2 2 2 2 2 
> 4;\n\tmso-font-charset:0;\n\tmso-generic-font-family:swiss;"
>
> I am using solr 1.3. Here is how I configured the tokenizer in schema.xml
>
>     positionIncrementGap="100">
>      
>        
>         words="stopwords.txt" enablePositionIncrements="true"/>
>         generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
> splitOnCaseChange="1"/>
>        
>         protected="protwords.txt"/>
>        
>      
>      
>        
>         ignoreCase="true" expand="true"/>
>         words="stopwords.txt"/>
>         generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1"/>
>        
>         protected="protwords.txt"/>
>        
>      
>    
>
> Am I doing something wrong?
>
> thank you
> Andréas Kündig
>
> World Intellectual Property Organization Disclaimer:
>
> This electronic message may contain privileged, confidential and
> copyright protected information. If you have received this e-mail
> by mistake, please immediately notify the sender and delete this
> e-mail and all its attachments. Please ensure all e-mail attachments
> are scanned for viruses prior to opening or using.
>

RE: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi

Sorry for OFF-topic:
Create dummy "Hello, World!" JSP, use Tomcat, execute load-stress
simulator(s) from separate machine(s), and measure... don't forget to
allocate necessary thread pools in Tomcat (if you have to)...
Although such JSP doesn't use any memory, you will see how easy one can go
with 5000 TPS (or 'virtually' 5 concurrent users) on modern quad-cores
by simply allocating more memory (...GB) and more Tomcat threads. There is
threshold too... repeat it with HTTPD Workers (and threads), same result,
although it doesn't use any GC. More memory - more threads - more "keep
alives" per TCP...

However, 'theoretically' you need only 64Mb for "Hello World" :)))

Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel

I have around 8M documents.
I set up my server to use a different collector and it seems like it
decreased from 11% to 4%, of course I need to wait a bit more because it is
just a 1 hour old log. But it seems like it is much better now.
I will tell you on Monday the results :)

On Fri, Sep 25, 2009 at 6:07 PM, Mark Miller  wrote:

> Thats a good point too - if you can reduce your need for such a large
> heap, by all means, do so.
>
> However, considering you already need at least 10GB or you get OOM, you
> have a long way to go with that approach. Good luck :)
>
> How many docs do you have ? I'm guessing its mostly FieldCache type
> stuff, and thats the type of thing you can't really side step, unless
> you give up the functionality thats using it.
>
> Grant Ingersoll wrote:
> >
> > On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:
> >
> >> Hi to all!
> >> Lately my solr servers seem to stop responding once in a while. I'm
> >> using
> >> solr 1.3.
> >> Of course I'm having more traffic on the servers.
> >> So I logged the Garbage Collection activity to check if it's because of
> >> that. It seems like 11% of the time the application runs, it is stopped
> >> because of GC. And some times the GC takes up to 10 seconds!
> >> Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
> >> servers. My index is around 10GB and I'm giving to the instances 10GB of
> >> RAM.
> >>
> >> How can I check which is the GC that it is being used? If I'm right JVM
> >> Ergonomics should use the Throughput GC, but I'm not 100% sure. Do
> >> you have
> >> any recommendation on this?
> >
> >
> > As I said in Eteve's thread on JVM settings, some extra time spent on
> > application design/debugging will save a whole lot of headache in
> > Garbage Collection and trying to tune the gazillion different options
> > available.  Ask yourself:  What is on the heap and does it need to be
> > there?  For instance, do you, if you have them, really need sortable
> > ints?   If your servers seem to come to a stop, I'm going to bet you
> > have major collections going on.  Major collections in a production
> > system are very bad.  They tend to happen right after commits in
> > poorly tuned systems, but can also happen in other places if you let
> > things build up due to really large heaps and/or things like really
> > large cache settings.  I would pull up jConsole and have a look at
> > what is happening when the pauses occur.  Is it a major collection?
> > If so, then hook up a heap analyzer or a profiler and see what is on
> > the heap around those times.  Then have a look at your schema/config,
> > etc. and see if there are things that are memory intensive (sorting,
> > faceting, excessively large filter caches).
> >
> > --
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> > using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

One more point and I'll stop - I've hit my email quota for the day ;)

While its a pain to have to juggle GC params and tune - when you require
a heap thats more than a gig or two, I personally believe its essential
to do so for good performance. The (default settings / ergonomics with
throughput) just don't cut it. Sad fact of life :) Luckily, you don't
generally have to do that much to get things nice - the number of
options is not that staggering, and you don't usually need to get into
most of them. Choosing the right collector, and tweaking a setting or
two can often be enough.

The most important to do with a large heap and the throughput collector
is to turn on parallel tenured collection. I've said it before, but it
really is key. At least if you have more than a processor or two -
which, for your sake, I hope you do :)

- Mark

Mark Miller wrote:
> Thats a good point too - if you can reduce your need for such a large
> heap, by all means, do so.
>
> However, considering you already need at least 10GB or you get OOM, you
> have a long way to go with that approach. Good luck :)
>
> How many docs do you have ? I'm guessing its mostly FieldCache type
> stuff, and thats the type of thing you can't really side step, unless
> you give up the functionality thats using it.
>
> Grant Ingersoll wrote:
>   
>> On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:
>>
>> 
>>> Hi to all!
>>> Lately my solr servers seem to stop responding once in a while. I'm
>>> using
>>> solr 1.3.
>>> Of course I'm having more traffic on the servers.
>>> So I logged the Garbage Collection activity to check if it's because of
>>> that. It seems like 11% of the time the application runs, it is stopped
>>> because of GC. And some times the GC takes up to 10 seconds!
>>> Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
>>> servers. My index is around 10GB and I'm giving to the instances 10GB of
>>> RAM.
>>>
>>> How can I check which is the GC that it is being used? If I'm right JVM
>>> Ergonomics should use the Throughput GC, but I'm not 100% sure. Do
>>> you have
>>> any recommendation on this?
>>>   
>> As I said in Eteve's thread on JVM settings, some extra time spent on
>> application design/debugging will save a whole lot of headache in
>> Garbage Collection and trying to tune the gazillion different options
>> available.  Ask yourself:  What is on the heap and does it need to be
>> there?  For instance, do you, if you have them, really need sortable
>> ints?   If your servers seem to come to a stop, I'm going to bet you
>> have major collections going on.  Major collections in a production
>> system are very bad.  They tend to happen right after commits in
>> poorly tuned systems, but can also happen in other places if you let
>> things build up due to really large heaps and/or things like really
>> large cache settings.  I would pull up jConsole and have a look at
>> what is happening when the pauses occur.  Is it a major collection? 
>> If so, then hook up a heap analyzer or a profiler and see what is on
>> the heap around those times.  Then have a look at your schema/config,
>> etc. and see if there are things that are memory intensive (sorting,
>> faceting, excessively large filter caches).
>>
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>> 
>
>
>   


-- 
- Mark

http://www.lucidimagination.com

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

Thats a good point too - if you can reduce your need for such a large
heap, by all means, do so.

However, considering you already need at least 10GB or you get OOM, you
have a long way to go with that approach. Good luck :)

How many docs do you have ? I'm guessing its mostly FieldCache type
stuff, and thats the type of thing you can't really side step, unless
you give up the functionality thats using it.

Grant Ingersoll wrote:
>
> On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:
>
>> Hi to all!
>> Lately my solr servers seem to stop responding once in a while. I'm
>> using
>> solr 1.3.
>> Of course I'm having more traffic on the servers.
>> So I logged the Garbage Collection activity to check if it's because of
>> that. It seems like 11% of the time the application runs, it is stopped
>> because of GC. And some times the GC takes up to 10 seconds!
>> Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
>> servers. My index is around 10GB and I'm giving to the instances 10GB of
>> RAM.
>>
>> How can I check which is the GC that it is being used? If I'm right JVM
>> Ergonomics should use the Throughput GC, but I'm not 100% sure. Do
>> you have
>> any recommendation on this?
>
>
> As I said in Eteve's thread on JVM settings, some extra time spent on
> application design/debugging will save a whole lot of headache in
> Garbage Collection and trying to tune the gazillion different options
> available.  Ask yourself:  What is on the heap and does it need to be
> there?  For instance, do you, if you have them, really need sortable
> ints?   If your servers seem to come to a stop, I'm going to bet you
> have major collections going on.  Major collections in a production
> system are very bad.  They tend to happen right after commits in
> poorly tuned systems, but can also happen in other places if you let
> things build up due to really large heaps and/or things like really
> large cache settings.  I would pull up jConsole and have a look at
> what is happening when the pauses occur.  Is it a major collection? 
> If so, then hook up a heap analyzer or a profiler and see what is on
> the heap around those times.  Then have a look at your schema/config,
> etc. and see if there are things that are memory intensive (sorting,
> faceting, excessively large filter caches).
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>


-- 
- Mark

http://www.lucidimagination.com

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

Mark Miller wrote:
> Jonathan Ariel wrote:
>   
>> How can I check which is the GC that it is being used? If I'm right JVM
>> Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have
>> any recommendation on this?
>>
>>   
>> 
> Just to straighten out this one too - Ergonomics doesn't use throughput
> - throughput is the collector that allows Ergonomics ;)
>
> And throughput is the default as long as your machine is detected as
> server class.
>
> But throughput is not great with large tenured spaces out of the box. It
> only parallelizes the new space collection. You have to turn on an
> option to get parallel tenured collection as well - which is essential
> to scale to large heap sizes.
>
>   
hmm - I'm not being totally accurate there - ergonomics is what detects
server and so makes throughput the default collector for a server
machine. But much of the GC ergonomics support only works with the
throughput collector. Kind of chicken and egg :)

-- 
- Mark

http://www.lucidimagination.com

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

Jonathan Ariel wrote:
> How can I check which is the GC that it is being used? If I'm right JVM
> Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have
> any recommendation on this?
>
>   
Just to straighten out this one too - Ergonomics doesn't use throughput
- throughput is the collector that allows Ergonomics ;)

And throughput is the default as long as your machine is detected as
server class.

But throughput is not great with large tenured spaces out of the box. It
only parallelizes the new space collection. You have to turn on an
option to get parallel tenured collection as well - which is essential
to scale to large heap sizes.

-- 
- Mark

http://www.lucidimagination.com

RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi

> Usually, fragmentation is dealt with using a mark-compact collector (or
> IBM has used a mark-sweep-compact collector).
> Copying collectors are not only super efficient at collecting young
> spaces, but they are also great for fragmentation - when you copy
> everything to the new space, you can remove any fragmentation. At the
> cost of double the space requirements though.


So that if memory size is optimized (application specific!) no any object
copy will ever happen, although it is server-loading specific too
(application-usage-specific; what do they do most frequently?)
- just statistics, need to monitor JVM and make decision.

Few years ago I had hard time explaining to client that byte array should be
Base64 encoded instead of just 123... instead of GC tuning...

SOLR uses XML; try to upload big XML - each Element instance needs at least
100 bytes... try to create array of 20M of Elements (parser will do!)... so
that any GC tuning is application-usage specific too... RAM allocation and
GC tuning is "usage"-specific, not SOLR-specific...

Re: Solr and Garbage Collection

2009-09-25 Thread Grant Ingersoll



On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote:


Hi to all!
Lately my solr servers seem to stop responding once in a while. I'm  
using

solr 1.3.
Of course I'm having more traffic on the servers.
So I logged the Garbage Collection activity to check if it's because  
of
that. It seems like 11% of the time the application runs, it is  
stopped

because of GC. And some times the GC takes up to 10 seconds!
Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel  
Xeon
servers. My index is around 10GB and I'm giving to the instances  
10GB of

RAM.

How can I check which is the GC that it is being used? If I'm right  
JVM
Ergonomics should use the Throughput GC, but I'm not 100% sure. Do  
you have

any recommendation on this?



As I said in Eteve's thread on JVM settings, some extra time spent on  
application design/debugging will save a whole lot of headache in  
Garbage Collection and trying to tune the gazillion different options  
available.  Ask yourself:  What is on the heap and does it need to be  
there?  For instance, do you, if you have them, really need sortable  
ints?   If your servers seem to come to a stop, I'm going to bet you  
have major collections going on.  Major collections in a production  
system are very bad.  They tend to happen right after commits in  
poorly tuned systems, but can also happen in other places if you let  
things build up due to really large heaps and/or things like really  
large cache settings.  I would pull up jConsole and have a look at  
what is happening when the pauses occur.  Is it a major collection?   
If so, then hook up a heap analyzer or a profiler and see what is on  
the heap around those times.  Then have a look at your schema/config,  
etc. and see if there are things that are memory intensive (sorting,  
faceting, excessively large filter caches).


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

>> or IBM has used a mark-sweep-compact collector

Never mind - Sun's is also sometimes referred to as mark-sweep-compact.
I've just seen it referred to as mark-compact before as well. In either
case though, without some sort of sweep phase, there is no reclamation
of memory :)

It's interesting though - in the days of the early JVM's Sun talked more
about compaction - but if you look at their recent info, they don't even
mention it, or give you params to messs with it. They just talk about
the mark and the sweep phase.

IBM is much more open about a compaction phase, and not only do they
give controls to tune it, they let you turn it off completely.

Not sure what Sun is doing with compaction these days - or if they just
work with fragmentation avoidance techniques instead - haven't seen any
info on it.


Mark Miller wrote:
> When we talk about Collectors, we are not just talking about
> "collecting" - whatever that means. There isn't really a "collecting"
> phase - the whole algorithm is garbage collecting - hence calling the
> different implementations "collectors".
>
> Usually, fragmentation is dealt with using a mark-compact collector (or
> IBM has used a mark-sweep-compact collector).
> Copying collectors are not only super efficient at collecting young
> spaces, but they are also great for fragmentation - when you copy
> everything to the new space, you can remove any fragmentation. At the
> cost of double the space requirements though.
>
> So mark-compact is a compromise. First you mark whats reachable, then
> everything thats marked is copied/compacted to the bottom of the heap.
> Its all part of a "collection" though.
>
> Jonathan Ariel wrote:
>   
>> Maybe what's missing here is how did I get the 11%.I just ran solr with the
>> following JVM params: -XX:+PrintGCApplicationConcurrentTime
>> -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
>> time the application run between collection pauses and the length of the
>> collection pauses, respectively.
>> I think that in this case the 11% is just for memory collection and not
>> defragmentation... but I'm not 100% sure.
>>
>> On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi  wrote:
>>
>>   
>> 
>>> But again, GC is not just "Garbage Collection" as many in this thread
>>> think... it is also "memory defragmentation" which is much costly than
>>> "collection" just because it needs move somewhere _live_objects_ (and
>>> wait/lock till such objects get unlocked to be moved...) - obviously more
>>> memory helps...
>>>
>>> 11% is extremely high.
>>>
>>>
>>> -Fuad
>>> http://www.linkedin.com/in/liferay
>>>
>>>
>>> 
>>>   
 -Original Message-
 From: Jonathan Ariel [mailto:ionat...@gmail.com]
 Sent: September-25-09 3:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: FW: Solr and Garbage Collection

 I'm not planning on lowering the heap. I just want to lower the time
 "wasted" on GC, which is 11% right now.So what I'll try is changing the
   
 
>>> GC
>>> 
>>>   
 to -XX:+UseConcMarkSweepGC

 On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:

   
 
> Mark,
>
> what if piece of code needs 10 contiguous Kb to load a document field?
> 
>   
>>> How
>>> 
>>>   
> locked memory pieces are optimized/moved (putting on hold almost whole
> application)?
> Lowering heap is _bad_ idea; we will have extremely frequent GC
> 
>   
>>> (optimize
>>> 
>>>   
> of
> live objects!!!) even if RAM is (theoretically) enough.
>
> -Fuad
>
>
> 
>   
>> Faud, you didn't read the thread right.
>>
>> He is not having a problem with OOM. He got the OOM because he
>>   
>> 
>>> lowered
>>> 
>>>   
>> the heap to try and help GC.
>>
>> He normally runs with a heap that can handle his FC.
>>
>> Please re-read the thread. You are confusing the tread.
>>
>> - Mark
>>
>>   
>> 
> 
>   
>>> GC will frequently happen even if RAM is more than enough: in case
>>> 
>>>   
>>> if
>>> it
>>> 
>>>   
> is
> 
>   
>>> heavily sparse... so that have even more RAM!
>>> -Fuad
>>> 
>>>   
> 
>   
>>> 
>>>   
>>   
>> 
>
>
>   


-- 
- Mark

http://www.lucidimagination.com

Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel

Ok. I'll first change the GC and see if the time spent decreased. Than
I'll try increasing the heap as Fuad recommends.

On 9/25/09, Mark Miller  wrote:
> When we talk about Collectors, we are not just talking about
> "collecting" - whatever that means. There isn't really a "collecting"
> phase - the whole algorithm is garbage collecting - hence calling the
> different implementations "collectors".
>
> Usually, fragmentation is dealt with using a mark-compact collector (or
> IBM has used a mark-sweep-compact collector).
> Copying collectors are not only super efficient at collecting young
> spaces, but they are also great for fragmentation - when you copy
> everything to the new space, you can remove any fragmentation. At the
> cost of double the space requirements though.
>
> So mark-compact is a compromise. First you mark whats reachable, then
> everything thats marked is copied/compacted to the bottom of the heap.
> Its all part of a "collection" though.
>
> Jonathan Ariel wrote:
>> Maybe what's missing here is how did I get the 11%.I just ran solr with
>> the
>> following JVM params: -XX:+PrintGCApplicationConcurrentTime
>> -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
>> time the application run between collection pauses and the length of the
>> collection pauses, respectively.
>> I think that in this case the 11% is just for memory collection and not
>> defragmentation... but I'm not 100% sure.
>>
>> On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi  wrote:
>>
>>
>>> But again, GC is not just "Garbage Collection" as many in this thread
>>> think... it is also "memory defragmentation" which is much costly than
>>> "collection" just because it needs move somewhere _live_objects_ (and
>>> wait/lock till such objects get unlocked to be moved...) - obviously more
>>> memory helps...
>>>
>>> 11% is extremely high.
>>>
>>>
>>> -Fuad
>>> http://www.linkedin.com/in/liferay
>>>
>>>
>>>
 -Original Message-
 From: Jonathan Ariel [mailto:ionat...@gmail.com]
 Sent: September-25-09 3:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: FW: Solr and Garbage Collection

 I'm not planning on lowering the heap. I just want to lower the time
 "wasted" on GC, which is 11% right now.So what I'll try is changing the

>>> GC
>>>
 to -XX:+UseConcMarkSweepGC

 On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:


> Mark,
>
> what if piece of code needs 10 contiguous Kb to load a document field?
>
>>> How
>>>
> locked memory pieces are optimized/moved (putting on hold almost whole
> application)?
> Lowering heap is _bad_ idea; we will have extremely frequent GC
>
>>> (optimize
>>>
> of
> live objects!!!) even if RAM is (theoretically) enough.
>
> -Fuad
>
>
>
>> Faud, you didn't read the thread right.
>>
>> He is not having a problem with OOM. He got the OOM because he
>>
>>> lowered
>>>
>> the heap to try and help GC.
>>
>> He normally runs with a heap that can handle his FC.
>>
>> Please re-read the thread. You are confusing the tread.
>>
>> - Mark
>>
>>
>
>>> GC will frequently happen even if RAM is more than enough: in case
>>>
>>> if
>>> it
>>>
> is
>
>>> heavily sparse... so that have even more RAM!
>>> -Fuad
>>>
>
>
>>>
>>>
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>

solr home

2009-09-25 Thread Park, Michael

I already have a handful of solr instances running .  However, I'm
trying to install solr (1.4) on a new linux server with tomcat using a
context file (same way I usually do):

 



   



 

However it throws an exception due to the following:

SEVERE: Could not start SOLR. Check solr/home property

java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in
classpath or 'solr/conf/', cwd=/opt/local/solr/fedora_solr

at
org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.
java:198)

at
org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.ja
va:166)

 

Any ideas why this is happening?

 

Thanks, Mike

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

When we talk about Collectors, we are not just talking about
"collecting" - whatever that means. There isn't really a "collecting"
phase - the whole algorithm is garbage collecting - hence calling the
different implementations "collectors".

Usually, fragmentation is dealt with using a mark-compact collector (or
IBM has used a mark-sweep-compact collector).
Copying collectors are not only super efficient at collecting young
spaces, but they are also great for fragmentation - when you copy
everything to the new space, you can remove any fragmentation. At the
cost of double the space requirements though.

So mark-compact is a compromise. First you mark whats reachable, then
everything thats marked is copied/compacted to the bottom of the heap.
Its all part of a "collection" though.

Jonathan Ariel wrote:
> Maybe what's missing here is how did I get the 11%.I just ran solr with the
> following JVM params: -XX:+PrintGCApplicationConcurrentTime
> -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
> time the application run between collection pauses and the length of the
> collection pauses, respectively.
> I think that in this case the 11% is just for memory collection and not
> defragmentation... but I'm not 100% sure.
>
> On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi  wrote:
>
>   
>> But again, GC is not just "Garbage Collection" as many in this thread
>> think... it is also "memory defragmentation" which is much costly than
>> "collection" just because it needs move somewhere _live_objects_ (and
>> wait/lock till such objects get unlocked to be moved...) - obviously more
>> memory helps...
>>
>> 11% is extremely high.
>>
>>
>> -Fuad
>> http://www.linkedin.com/in/liferay
>>
>>
>> 
>>> -Original Message-
>>> From: Jonathan Ariel [mailto:ionat...@gmail.com]
>>> Sent: September-25-09 3:36 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: FW: Solr and Garbage Collection
>>>
>>> I'm not planning on lowering the heap. I just want to lower the time
>>> "wasted" on GC, which is 11% right now.So what I'll try is changing the
>>>   
>> GC
>> 
>>> to -XX:+UseConcMarkSweepGC
>>>
>>> On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:
>>>
>>>   
 Mark,

 what if piece of code needs 10 contiguous Kb to load a document field?
 
>> How
>> 
 locked memory pieces are optimized/moved (putting on hold almost whole
 application)?
 Lowering heap is _bad_ idea; we will have extremely frequent GC
 
>> (optimize
>> 
 of
 live objects!!!) even if RAM is (theoretically) enough.

 -Fuad


 
> Faud, you didn't read the thread right.
>
> He is not having a problem with OOM. He got the OOM because he
>   
>> lowered
>> 
> the heap to try and help GC.
>
> He normally runs with a heap that can handle his FC.
>
> Please re-read the thread. You are confusing the tread.
>
> - Mark
>
>   
 
>> GC will frequently happen even if RAM is more than enough: in case
>> 
>> if
>> it
>> 
 is
 
>> heavily sparse... so that have even more RAM!
>> -Fuad
>> 

 
>>
>> 
>
>   


-- 
- Mark

http://www.lucidimagination.com

Re: Can we point a Solr server to index directory dynamically at runtime..

2009-09-25 Thread Silent Surfer

Hi Michael,

We are storing all our data in addition to index, as we need to display those 
values to the user. So unfortunately we cannot go with the option stored=false, 
which could have potentially solved our issue.

Appreciate any other pointers/suggestions

Thanks,
sS

--- On Fri, 9/25/09, Michael  wrote:

> From: Michael 
> Subject: Re: Can we point a Solr server to index directory dynamically at  
> runtime..
> To: solr-user@lucene.apache.org
> Date: Friday, September 25, 2009, 2:00 PM
> Are you storing (in addition to
> indexing) your data?  Perhaps you could turn
> off storage on data older than 7 days (requires
> reindexing), thus losing the
> ability to return snippets but cutting down on your storage
> space and server
> count.  I've experienced 10x decrease in space
> requirements and a large
> boost in speed after cutting extraneous storage from Solr
> -- the stored data
> is mixed in with the index data and so it slows down
> searches.
> You could also put all 200G onto one Solr instance rather
> than 10 for >7days
> data, and accept that those searches will be slower.
> 
> Michael
> 
> On Fri, Sep 25, 2009 at 1:34 AM, Silent Surfer 
> wrote:
> 
> > Hi,
> >
> > Thank you Michael and Chris for the response.
> >
> > Today after the mail from Michael, we tested with the
> dynamic loading of
> > cores and it worked well. So we need to go with the
> hybrid approach of
> > Multicore and Distributed searching.
> >
> > As per our testing, we found that a Solr instance with
> 20 GB of
> > index(single index or spread across multiple cores)
> can provide better
> > performance when compared to having a Solr instance
> say 40 (or) 50 GB of
> > index (single index or index spread across cores).
> >
> > So the 200 GB of index on day 1 will be spread across
> 200/20=10 Solr salve
> > instances.
> >
> > On day 2 data, 10 more Solr slave servers are
> required; Cumulative Solr
> > Slave instances = 200*2/20=20
> > ...
> > ..
> > ..
> > On day 30 data, 10 more Solr slave servers are
> required; Cumulative Solr
> > Slave instances = 200*30/20=300
> >
> > So with the above approach, we may need ~300 Solr
> slave instances, which
> > becomes very unmanageable.
> >
> > But we know that most of the queries is for the past 1
> week, i.e we
> > definitely need 70 Solr Slaves containing the last 7
> days worth of data up
> > and running.
> >
> > Now for the rest of the 230 Solr instances, do we need
> to keep it running
> > for the odd query,that can span across the 30 days of
> data (30*200 GB=6 TB
> > data) which can come up only a couple of times a day.
> > This linear increase of Solr servers with the
> retention period doesn't
> > seems to be a very scalable solution.
> >
> > So we are looking for something more simpler approach
> to handle this
> > scenario.
> >
> > Appreciate any further inputs/suggestions.
> >
> > Regards,
> > sS
> >
> > --- On Fri, 9/25/09, Chris Hostetter 
> wrote:
> >
> > > From: Chris Hostetter 
> > > Subject: Re: Can we point a Solr server to index
> directory dynamically
> > at  runtime..
> > > To: solr-user@lucene.apache.org
> > > Date: Friday, September 25, 2009, 4:04 AM
> > > : Using a multicore approach, you
> > > could send a "create a core named
> > > : 'core3weeksold' pointing to
> '/datadirs/3weeksold' "
> > > command to a live Solr,
> > > : which would spin it up on the fly.  Then
> you query
> > > it, and maybe keep it
> > > : spun up until it's not queried for 60 seconds
> or
> > > something, then send a
> > > : "remove core 'core3weeksold' " command.
> > > : See http://wiki.apache.org/solr/CoreAdmin#CoreAdminHandler
> > > .
> > >
> > > something that seems implicit in the question is
> what to do
> > > when the
> > > request spans all of the data ... this is where
> (in theory)
> > > distributed
> > > searching could help you out.
> > >
> > > index each days worth of data into it's own core,
> that
> > > makes it really
> > > easy to expire the old data (just UNLOAD and
> delete an
> > > entire core once
> > > it's more then 30 days old) if your user is only
> searching
> > > "current" dta
> > > then your app can directly query the core
> containing the
> > > most current data
> > > -- but if they want to query the last week, or
> last two
> > > weeks worth of
> > > data, you do a distributed request for all of the
> shards
> > > needed to search
> > > the appropriate amount of data.
> > >
> > > Between the ALIAS and SWAP commands it on the
> CoreAdmin
> > > screen it should
> > > be pretty easy have cores with names like
> > > "today","1dayold","2dayold" so
> > > that your app can configure simple shard params
> for all the
> > > perumations
> > > you'll need to query.
> > >
> > >
> > > -Hoss
> > >
> > >
> >
> >
> >
> >
> >
> >
>

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel

Maybe what's missing here is how did I get the 11%.I just ran solr with the
following JVM params: -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime with that I can measure the amount of
time the application run between collection pauses and the length of the
collection pauses, respectively.
I think that in this case the 11% is just for memory collection and not
defragmentation... but I'm not 100% sure.

On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi  wrote:

> But again, GC is not just "Garbage Collection" as many in this thread
> think... it is also "memory defragmentation" which is much costly than
> "collection" just because it needs move somewhere _live_objects_ (and
> wait/lock till such objects get unlocked to be moved...) - obviously more
> memory helps...
>
> 11% is extremely high.
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
>
>
> > -Original Message-
> > From: Jonathan Ariel [mailto:ionat...@gmail.com]
> > Sent: September-25-09 3:36 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: FW: Solr and Garbage Collection
> >
> > I'm not planning on lowering the heap. I just want to lower the time
> > "wasted" on GC, which is 11% right now.So what I'll try is changing the
> GC
> > to -XX:+UseConcMarkSweepGC
> >
> > On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:
> >
> > > Mark,
> > >
> > > what if piece of code needs 10 contiguous Kb to load a document field?
> How
> > > locked memory pieces are optimized/moved (putting on hold almost whole
> > > application)?
> > > Lowering heap is _bad_ idea; we will have extremely frequent GC
> (optimize
> > > of
> > > live objects!!!) even if RAM is (theoretically) enough.
> > >
> > > -Fuad
> > >
> > >
> > > >Faud, you didn't read the thread right.
> > > >
> > > > He is not having a problem with OOM. He got the OOM because he
> lowered
> > > > the heap to try and help GC.
> > > >
> > > > He normally runs with a heap that can handle his FC.
> > > >
> > > > Please re-read the thread. You are confusing the tread.
> > > >
> > > > - Mark
> > > >
> > >
> > >
> > > >> GC will frequently happen even if RAM is more than enough: in case
> if
> it
> > > is
> > > >> heavily sparse... so that have even more RAM!
> > > >> -Fuad
> > >
> > >
> > >
>
>
>

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Yonik Seeley

On Fri, Sep 25, 2009 at 2:52 PM, Fuad Efendi  wrote:
> Lowering heap helps GC?

Yes.  In general, lowering the heap can help or hurt.

Hurt: if one is running very low on memory, GC will be working harder
all of the time trying to find more memory and the % of time that GC
takes can go up.

Help: if one has massive heaps, full GCs may not happen as frequently,
but when they do they can be larger and cause more of a problem.  For
many apps, a .2 second pause every minute is preferable to a 10 second
pause every hour.

And of course the other reason to lower the heap size *if* you don't
need it that big is to leave more memory for other stuff, and for the
OS itself to cache the index files.

-Yonik
http://www.lucidimagination.com

RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi

But again, GC is not just "Garbage Collection" as many in this thread
think... it is also "memory defragmentation" which is much costly than
"collection" just because it needs move somewhere _live_objects_ (and
wait/lock till such objects get unlocked to be moved...) - obviously more
memory helps...

11% is extremely high.

 
-Fuad
http://www.linkedin.com/in/liferay


> -Original Message-
> From: Jonathan Ariel [mailto:ionat...@gmail.com]
> Sent: September-25-09 3:36 PM
> To: solr-user@lucene.apache.org
> Subject: Re: FW: Solr and Garbage Collection
> 
> I'm not planning on lowering the heap. I just want to lower the time
> "wasted" on GC, which is 11% right now.So what I'll try is changing the GC
> to -XX:+UseConcMarkSweepGC
> 
> On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:
> 
> > Mark,
> >
> > what if piece of code needs 10 contiguous Kb to load a document field?
How
> > locked memory pieces are optimized/moved (putting on hold almost whole
> > application)?
> > Lowering heap is _bad_ idea; we will have extremely frequent GC
(optimize
> > of
> > live objects!!!) even if RAM is (theoretically) enough.
> >
> > -Fuad
> >
> >
> > >Faud, you didn't read the thread right.
> > >
> > > He is not having a problem with OOM. He got the OOM because he lowered
> > > the heap to try and help GC.
> > >
> > > He normally runs with a heap that can handle his FC.
> > >
> > > Please re-read the thread. You are confusing the tread.
> > >
> > > - Mark
> > >
> >
> >
> > >> GC will frequently happen even if RAM is more than enough: in case if
it
> > is
> > >> heavily sparse... so that have even more RAM!
> > >> -Fuad
> >
> >
> >

Re: shards and facet_count

2009-09-25 Thread Paul Rosen

Sorry for the long delay in responding, but I've just gotten back to 
this problem...


I got the solr 1.4 nightly and the problem went away, so I guess it is a 
solr 1.3 bug.


Thanks for all the input!

Lance Norskog wrote:

Paul, can you create an HTTP url that does this exact query? With
multiple shards and facet requests?  And that does what you expect?
That would help the Ruby Dudes to figure out the discrepancy.

Lance

On Fri, Sep 18, 2009 at 7:01 PM, Yonik Seeley
 wrote:

On Fri, Sep 18, 2009 at 5:58 AM, Erik Hatcher  wrote:

It is strange that you get facet=false calls in there, but maybe this is
just normal distributed search protocol in one of the phases?

Right, on the second phase of a distrib request, additional faceting
may not be needed.

But it looks like the distributed request is being directed at two
different handlers rather than two different servers or cores?
shards=localhost:8983/solr/resources,localhost:8983/solr/exhibits

I've never tried this, but from the log file, it doesn't look like the
sub-requests are going to those different handlers since the path is
always path=/select

-Yonik
http://www.lucidimagination.com

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel

I'm not planning on lowering the heap. I just want to lower the time
"wasted" on GC, which is 11% right now.So what I'll try is changing the GC
to -XX:+UseConcMarkSweepGC

On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi  wrote:

> Mark,
>
> what if piece of code needs 10 contiguous Kb to load a document field? How
> locked memory pieces are optimized/moved (putting on hold almost whole
> application)?
> Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize
> of
> live objects!!!) even if RAM is (theoretically) enough.
>
> -Fuad
>
>
> >Faud, you didn't read the thread right.
> >
> > He is not having a problem with OOM. He got the OOM because he lowered
> > the heap to try and help GC.
> >
> > He normally runs with a heap that can handle his FC.
> >
> > Please re-read the thread. You are confusing the tread.
> >
> > - Mark
> >
>
>
> >> GC will frequently happen even if RAM is more than enough: in case if it
> is
> >> heavily sparse... so that have even more RAM!
> >> -Fuad
>
>
>

Hierarchical Facet Field Prefix Not Working

2009-09-25 Thread Nasseam Elkarra


Hello all,

We are using the patch from SOLR-64 (http://issues.apache.org/jira/browse/SOLR-64 
) to implement hierarchical facets for categories. We are trying to  
use the facet.prefix to prevent all categories from coming back.  
However, f.category.facet.prefix doesn't work. Using facet.prefix  
works but prevents the other facets from coming back since it is a  
global option. Are per facet options supported on hierarchical facet  
fields? If not, how can I get a specific category and it's children  
without getting the surrounding categories?


Any help is much appreciated.

Thank you,

Nasseam Elkarra
http://bodukai.com/boutique/
The fastest possible shopping experience.

RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi

Mark,

what if piece of code needs 10 contiguous Kb to load a document field? How
locked memory pieces are optimized/moved (putting on hold almost whole
application)?
Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize of
live objects!!!) even if RAM is (theoretically) enough. 

-Fuad


>Faud, you didn't read the thread right.
> 
> He is not having a problem with OOM. He got the OOM because he lowered
> the heap to try and help GC.
> 
> He normally runs with a heap that can handle his FC.
> 
> Please re-read the thread. You are confusing the tread.
> 
> - Mark
> 


>> GC will frequently happen even if RAM is more than enough: in case if it
is
>> heavily sparse... so that have even more RAM!
>> -Fuad

RE: FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi

> He is not having a problem with OOM. He got the OOM because he lowered
> the heap to try and help GC.

That is very confusing!!!

Lowering heap helps GC? Someone mentioned it in this thread, but my
viewpoint is completely opposite.

1. Some RAM is needed to_be_reserved for FieldCache (it will be populated
over time, kind of "memory leak" not-well-documented).
2. Some RAM is needed for the rest of application.

And, some pieces of code frequently need contiguous memory (100 bytes? 1000
bytes?), so that GC-optimize will run even if memory is more than
(theoretically) enough.

So that... have even more memory.


I had similar problems with 8Gb; I don't have any problem with 16Gb. And I
never waste time on GC optimization, server was running without OOM and any
performance issues during almost a year.


>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>>
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3

Re: FW: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

Faud, you didn't read the thread right.

He is not having a problem with OOM. He got the OOM because he lowered
the heap to try and help GC.

He normally runs with a heap that can handle his FC.

Please re-read the thread. You are confusing the tread.

- Mark

Fuad Efendi wrote:
> Guys, thanks for GC discussion; but the root of a problem is FieldCache
> internals.
>
> Not enough RAM for FieldCache will cause unpredictable OOM, and it does not
> depend on GC. How much RAM FieldCache needs in case of 2 different
> values for a Field, 200 bytes each (Unicode), and 100M documents? What if we
> have 100 such non-tokenized fields in a schema?
>
>
> SOLR has an option to warm up caches on startup which might help
> troubleshooting.
>
>
> JRockit JVM has 'realtime' version if you are interested in predictable GC
> (without delaying 'transaction')...
>
> GC will frequently happen even if RAM is more than enough: in case if it is
> heavily sparse... so that have even more RAM!
>
>
>
> -Original Message-
> From: Fuad Efendi [mailto:f...@efendi.ca] 
> Sent: September-25-09 12:17 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr and Garbage Collection
>
>   
>> You are saying that I should give more memory than 12GB?
>> 
>
>
> Yes. Look at this:
>
>   
>>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>>>   
> org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
>   
>> 61
>> 
>>> )
>>>   
>
>
>
> It can't find few (!!!) contiguous bytes for .createValue(...)
>
> It can't add (Field Value, Document ID) pair to an array.
>
> GC tuning won't help in this specific case...
>
> May be SOLR/Lucene core developers may WARM FieldCache at IndexReader
> opening time, in the future... to have early OOM...
>
>
> Avoiding faceting (and sorting) on such field will only postpone OOM to
> unpredictable date/time...
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
>
>
>
>
>
>   


-- 
- Mark

http://www.lucidimagination.com

FW: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi

Guys, thanks for GC discussion; but the root of a problem is FieldCache
internals.

Not enough RAM for FieldCache will cause unpredictable OOM, and it does not
depend on GC. How much RAM FieldCache needs in case of 2 different
values for a Field, 200 bytes each (Unicode), and 100M documents? What if we
have 100 such non-tokenized fields in a schema?


SOLR has an option to warm up caches on startup which might help
troubleshooting.


JRockit JVM has 'realtime' version if you are interested in predictable GC
(without delaying 'transaction')...

GC will frequently happen even if RAM is more than enough: in case if it is
heavily sparse... so that have even more RAM!



-Original Message-
From: Fuad Efendi [mailto:f...@efendi.ca] 
Sent: September-25-09 12:17 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr and Garbage Collection

> You are saying that I should give more memory than 12GB?


Yes. Look at this:

> > SEVERE: java.lang.OutOfMemoryError: Java heap space
>
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
> 61
> > )



It can't find few (!!!) contiguous bytes for .createValue(...)

It can't add (Field Value, Document ID) pair to an array.

GC tuning won't help in this specific case...

May be SOLR/Lucene core developers may WARM FieldCache at IndexReader
opening time, in the future... to have early OOM...


Avoiding faceting (and sorting) on such field will only postpone OOM to
unpredictable date/time...


-Fuad
http://www.linkedin.com/in/liferay

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

This all applies to having more than once processor though - if you have
one processor, than non concurrent can also make sense.

But especially with the young space, you want concurrency - with upto
98% of objects being short lived, and multiple threads generally
creating new objects, its a huge boon to collect the young space
concurrently.

Mark Miller wrote:
> Walter Underwood wrote:
>   
>> For batch-oriented computing, like Hadoop, the most efficient GC is probably
>> a non-concurrent, non-generational GC. 
>> 
> Okay - for batch we somewhat agree I guess - if you can stand any length
> of pausing, non concurrent can be nice, because you don't pay for thread
> sync communication. Only with a small heap size though (less than 100MB
> is what I've seen). You would pause the batch job while GC takes place.
> If you have 8 processors, and you are pausing all of them to collect a
> large heap using only 1 processor, that doesn't make much sense to me.
> The thread communication pain will be far outweighed by using more
> processors to do the collection faster, and not "stop the world" for
> your batch job so long. Stopping your application dead in its tracks,
> and then only using one of the available processors to collect a large
> heap, while the rest sit idle, doesn't make much sense.
>
> I also don't agree it ever really makes sense not to do generational
> collection. What is your argument here? Generational collection is
> **way** more efficient for short lived objects, which tend to be up to
> 98% of the objects in most applications. The only way I see that making
> sense is if you have almost no short lived objects (which occurs in
> what, .0001% of apps if at all?). The Sun JVM doesn't even offer a non
> generational approach anymore. It's just standard GC practice.
>   
>> I doubt that there are many
>> batch-oriented applications of Solr, though.
>>
>> The rest of the advice is intended to be general and it sounds like we agree
>> about sizing. If the nursery is not big enough, the tenured space will be
>> used for allocations that have a short lifetime and that will increase the
>> length and/or frequency of major collections.
>>   
>> 
> Yes - I wasn't arguing with every point - I was picking and choosing :)
> After the heap size, the size of the young generation is the most
> important factor.
>   
>> Cache evictions are the interesting part, because they cause a constant rate
>> of tenured space garbage. In most many servers, you can get a big enough
>> nursery that major collections are very rare. That won't happen in Solr
>> because of cache evictions.
>>
>> The IBM JVM is excellent. Their concurrent generational GC policy is
>> "gencon".
>>   
>> 
> Yeah, I actually know very little about the IBM JVM, so I wasn't really
> commenting. But from the info I gleaned here and on a couple quick web
> searches, I'm not too impressed by it's GC.
>   
>> wunder
>>
>> -Original Message-
>> From: Mark Miller [mailto:markrmil...@gmail.com] 
>> Sent: Friday, September 25, 2009 10:31 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr and Garbage Collection
>>
>> My bad - later, it looks as if your giving general advice, and thats
>> what I took issue with.
>>
>> Any Collector that is not doing generational collection is essentially
>> from the dark ages and shouldn't be used.
>>
>> Any Collector that doesn't have concurrent options, unless possibly your
>> running a tiny app (under 100MB of RAM), or only have a single CPU, is
>> also dark ages, and not fit for a server environement.
>>
>> I havn't kept up with IBM's JVM, but it sounds like they are well behind
>> Sun in GC then.
>>
>> - Mark
>>
>> Walter Underwood wrote:
>>   
>> 
>>> As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
>>> pause" collector is only in the Sun JVM.
>>>
>>> I just found this excellent article about the various IBM GC options for a
>>> Lucene application with a 100GB heap:
>>>
>>>
>>> 
>>>   
>> http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
>>   
>> 
>>> _h.html
>>>
>>> wunder
>>>
>>> -Original Message-
>>> From: Mark Miller [mailto:markrmil...@gmail.com] 
>>> Sent: Friday, September 25, 2009 10:03 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr and Garbage Collection
>>>
>>> Walter Underwood wrote:
>>>   
>>> 
>>>   
 30ms is not better or worse than 1s until you look at the service
 requirements. For many applications, it is worth dedicating 10% of your
 processing time to GC if that makes the worst-case pause short.

 On the other hand, my experience with the IBM JVM was that the maximum
 
   
 
>>> query
>>>   
>>> 
>>>   
 rate was 2-3X better with the concurrent generational GC compared to any
 
   
 
>>> of
>>>   
>>> 
>>>   
 their other GC algorithms, so we got the best throughput along with th

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

Walter Underwood wrote:
> For batch-oriented computing, like Hadoop, the most efficient GC is probably
> a non-concurrent, non-generational GC. 
Okay - for batch we somewhat agree I guess - if you can stand any length
of pausing, non concurrent can be nice, because you don't pay for thread
sync communication. Only with a small heap size though (less than 100MB
is what I've seen). You would pause the batch job while GC takes place.
If you have 8 processors, and you are pausing all of them to collect a
large heap using only 1 processor, that doesn't make much sense to me.
The thread communication pain will be far outweighed by using more
processors to do the collection faster, and not "stop the world" for
your batch job so long. Stopping your application dead in its tracks,
and then only using one of the available processors to collect a large
heap, while the rest sit idle, doesn't make much sense.

I also don't agree it ever really makes sense not to do generational
collection. What is your argument here? Generational collection is
**way** more efficient for short lived objects, which tend to be up to
98% of the objects in most applications. The only way I see that making
sense is if you have almost no short lived objects (which occurs in
what, .0001% of apps if at all?). The Sun JVM doesn't even offer a non
generational approach anymore. It's just standard GC practice.
> I doubt that there are many
> batch-oriented applications of Solr, though.
>
> The rest of the advice is intended to be general and it sounds like we agree
> about sizing. If the nursery is not big enough, the tenured space will be
> used for allocations that have a short lifetime and that will increase the
> length and/or frequency of major collections.
>   
Yes - I wasn't arguing with every point - I was picking and choosing :)
After the heap size, the size of the young generation is the most
important factor.
> Cache evictions are the interesting part, because they cause a constant rate
> of tenured space garbage. In most many servers, you can get a big enough
> nursery that major collections are very rare. That won't happen in Solr
> because of cache evictions.
>
> The IBM JVM is excellent. Their concurrent generational GC policy is
> "gencon".
>   
Yeah, I actually know very little about the IBM JVM, so I wasn't really
commenting. But from the info I gleaned here and on a couple quick web
searches, I'm not too impressed by it's GC.
> wunder
>
> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com] 
> Sent: Friday, September 25, 2009 10:31 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr and Garbage Collection
>
> My bad - later, it looks as if your giving general advice, and thats
> what I took issue with.
>
> Any Collector that is not doing generational collection is essentially
> from the dark ages and shouldn't be used.
>
> Any Collector that doesn't have concurrent options, unless possibly your
> running a tiny app (under 100MB of RAM), or only have a single CPU, is
> also dark ages, and not fit for a server environement.
>
> I havn't kept up with IBM's JVM, but it sounds like they are well behind
> Sun in GC then.
>
> - Mark
>
> Walter Underwood wrote:
>   
>> As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
>> pause" collector is only in the Sun JVM.
>>
>> I just found this excellent article about the various IBM GC options for a
>> Lucene application with a 100GB heap:
>>
>>
>> 
> http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
>   
>> _h.html
>>
>> wunder
>>
>> -Original Message-
>> From: Mark Miller [mailto:markrmil...@gmail.com] 
>> Sent: Friday, September 25, 2009 10:03 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr and Garbage Collection
>>
>> Walter Underwood wrote:
>>   
>> 
>>> 30ms is not better or worse than 1s until you look at the service
>>> requirements. For many applications, it is worth dedicating 10% of your
>>> processing time to GC if that makes the worst-case pause short.
>>>
>>> On the other hand, my experience with the IBM JVM was that the maximum
>>> 
>>>   
>> query
>>   
>> 
>>> rate was 2-3X better with the concurrent generational GC compared to any
>>> 
>>>   
>> of
>>   
>> 
>>> their other GC algorithms, so we got the best throughput along with the
>>> shortest pauses.
>>>   
>>> 
>>>   
>> With which collector? Since the very early JVM's, all GC is generational.
>> Most of the collectors (other than the Serial Collector) also work
>> concurrently.
>> By default, they are concurrent on different generations, but you can
>> add concurrency
>> to the "other" generation with each now too.
>>   
>> 
>>> Solr garbage generation (for queries) seems to have two major components:
>>> per-request garbage and cache evictions. With a generational collector,
>>> these two are handled by separate parts of the collector.
>>> 
>>>   
>> Different parts of the

Solr + Jboss + Custom Transformers

2009-09-25 Thread Papiya Misra


Hi

I am trying to use a custom transformer that extends
org.apache.solr.handler.dataimport.Transformer.

I have the CustomTransformer.jar and DataImportHandler.jar in
JBOSS/server/default/lib. I have the solr.war (as is from the distro) in
the JBOSS/server/default/deploy.

org.apache.solr.handler.dataimport.EntityProcessorWrapper (line 110)
returns false for the following code
 clazz.newInstance() instanceof Transformer


This happens because the CustomTransformer uses the Transformer from a
different ClassLoader than the Solr web application.

I could use the source code to create solr.war that includes the
CustomTransformer class. Is there any other option - one that preferably
does not include re-packaging solr.war ?

Thanks
Papiya

Pink OTC Markets Inc. provides the leading inter-dealer quotation and trading 
system in the over-the-counter (OTC) securities market.   We create innovative 
technology and data solutions to efficiently connect market participants, 
improve price discovery, increase issuer disclosure, and better inform 
investors.   Our marketplace, comprised of the issuer-listed OTCQX and 
broker-quoted   Pink Sheets, is the third largest U.S. equity trading venue for 
company shares.

This document contains confidential information of Pink OTC Markets and is only 
intended for the recipient.   Do not copy, reproduce (electronically or 
otherwise), or disclose without the prior written consent of Pink OTC Markets.  
If you receive this message in error, please destroy all copies in your 
possession (electronically or otherwise) and contact the sender above.

RE: Solr and Garbage Collection

2009-09-25 Thread Walter Underwood

For batch-oriented computing, like Hadoop, the most efficient GC is probably
a non-concurrent, non-generational GC. I doubt that there are many
batch-oriented applications of Solr, though.

The rest of the advice is intended to be general and it sounds like we agree
about sizing. If the nursery is not big enough, the tenured space will be
used for allocations that have a short lifetime and that will increase the
length and/or frequency of major collections.

Cache evictions are the interesting part, because they cause a constant rate
of tenured space garbage. In most many servers, you can get a big enough
nursery that major collections are very rare. That won't happen in Solr
because of cache evictions.

The IBM JVM is excellent. Their concurrent generational GC policy is
"gencon".

wunder

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, September 25, 2009 10:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr and Garbage Collection

My bad - later, it looks as if your giving general advice, and thats
what I took issue with.

Any Collector that is not doing generational collection is essentially
from the dark ages and shouldn't be used.

Any Collector that doesn't have concurrent options, unless possibly your
running a tiny app (under 100MB of RAM), or only have a single CPU, is
also dark ages, and not fit for a server environement.

I havn't kept up with IBM's JVM, but it sounds like they are well behind
Sun in GC then.

- Mark

Walter Underwood wrote:
> As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
> pause" collector is only in the Sun JVM.
>
> I just found this excellent article about the various IBM GC options for a
> Lucene application with a 100GB heap:
>
>
http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
> _h.html
>
> wunder
>
> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com] 
> Sent: Friday, September 25, 2009 10:03 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr and Garbage Collection
>
> Walter Underwood wrote:
>   
>> 30ms is not better or worse than 1s until you look at the service
>> requirements. For many applications, it is worth dedicating 10% of your
>> processing time to GC if that makes the worst-case pause short.
>>
>> On the other hand, my experience with the IBM JVM was that the maximum
>> 
> query
>   
>> rate was 2-3X better with the concurrent generational GC compared to any
>> 
> of
>   
>> their other GC algorithms, so we got the best throughput along with the
>> shortest pauses.
>>   
>> 
> With which collector? Since the very early JVM's, all GC is generational.
> Most of the collectors (other than the Serial Collector) also work
> concurrently.
> By default, they are concurrent on different generations, but you can
> add concurrency
> to the "other" generation with each now too.
>   
>> Solr garbage generation (for queries) seems to have two major components:
>> per-request garbage and cache evictions. With a generational collector,
>> these two are handled by separate parts of the collector.
>> 
> Different parts of the collector? Its a different collector depending on
> the generation.
> The young generation is collected with a copy collector. This is because
> almost all the objects
> in the young generation are likely dead, and a copy collector only needs
> to visit live objects. So
> its very efficient. The tenured generation uses something more along the
> lines of mark and sweep or mark
> and compact.
>   
>>  Per-request
>> garbage should completely fit in the short-term heap (nursery), so that
it
>> can be collected rapidly and returned to use for further requests. If the
>> nursery is too small, the per-request allocations will be made in tenured
>> space and sit there until the next major GC. Cache evictions are almost
>> always in long-term storage (tenured space) because an LRU algorithm
>> guarantees that the garbage will be old.
>>
>> Check the growth rate of tenured space (under constant load, of course)
>> while increasing the size of the nursery. That rate should drop when the
>> nursery gets big enough, then not drop much further as it is increased
>> 
> more.
>   
>> After that, reduce the size of tenured space until major GCs start
>> 
> happening
>   
>> "too often" (a judgment call). A bigger tenured space means longer major
>> 
> GCs
>   
>> and thus longer pauses, so you don't want it oversized by too much.
>>   
>> 
> With the concurrent low pause collector, the goal is to avoid "major"
> collections,
> by collecting *before* the tenured space is filled. If you you are
> getting "major" collections,
> you need to tune your settings - the whole point of that collector is to
> avoid "major"
> collections, and do almost all of the work while your application is not
> paused. There are
> still 2 brief pauses during the collection, but they should not be
> significant at all.
>   
>> Al

8 for 1.4

2009-09-25 Thread Grant Ingersoll


Y'all,

We're down to 8 open issues:  
https://issues.apache.org/jira/secure/BrowseVersion.jspa?id=12310230&versionId=12313351&showOpenIssuesOnly=true

2 are packaging related, one is dependent on the official 2.9 release  
(so should be taken care of today or tomorrow I suspect) and then we  
have a few others.


The only two somewhat major ones are S-1458, S-1294 (more on this in a  
mo') and S-1449.


On S-1294, the SolrJS patch, I yet again have concerns about even  
including this, given the lack of activity (from Matthias, the  
original author and others) and the fact that some in the Drupal  
community have already forked this to fix the various bugs in it  
instead of just submitting patches.  While I really like the idea of  
this library (jQuery is awesome), I have yet to see interest in the  
community to maintain it (unless you count someone forking it and  
fixing the bugs in the fork as maintenance) and I'll be upfront in  
admitting I have neither the time nor the patience to debug Javascript  
across the gazillions of browsers out there (I don't even have IE on  
my machine unless you count firing up a VM w/ XP on it) in the wild.   
Given what I know of most of the other committers here, I suspect that  
is true for others too.  At a minimum, I think S-1294 should be pushed  
to 1.5.  Next up, I think we consider pulling SolrJS from the release,  
but keeping it in trunk and officially releasing it with either 1.5 or  
1.4.1, assuming its gotten some love in the meantime.  If by then it  
has no love, I vote we remove it and let the fork maintain it and  
point people there.


-Grant

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

My bad - later, it looks as if your giving general advice, and thats
what I took issue with.

Any Collector that is not doing generational collection is essentially
from the dark ages and shouldn't be used.

Any Collector that doesn't have concurrent options, unless possibly your
running a tiny app (under 100MB of RAM), or only have a single CPU, is
also dark ages, and not fit for a server environement.

I havn't kept up with IBM's JVM, but it sounds like they are well behind
Sun in GC then.

- Mark

Walter Underwood wrote:
> As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
> pause" collector is only in the Sun JVM.
>
> I just found this excellent article about the various IBM GC options for a
> Lucene application with a 100GB heap:
>
> http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
> _h.html
>
> wunder
>
> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com] 
> Sent: Friday, September 25, 2009 10:03 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr and Garbage Collection
>
> Walter Underwood wrote:
>   
>> 30ms is not better or worse than 1s until you look at the service
>> requirements. For many applications, it is worth dedicating 10% of your
>> processing time to GC if that makes the worst-case pause short.
>>
>> On the other hand, my experience with the IBM JVM was that the maximum
>> 
> query
>   
>> rate was 2-3X better with the concurrent generational GC compared to any
>> 
> of
>   
>> their other GC algorithms, so we got the best throughput along with the
>> shortest pauses.
>>   
>> 
> With which collector? Since the very early JVM's, all GC is generational.
> Most of the collectors (other than the Serial Collector) also work
> concurrently.
> By default, they are concurrent on different generations, but you can
> add concurrency
> to the "other" generation with each now too.
>   
>> Solr garbage generation (for queries) seems to have two major components:
>> per-request garbage and cache evictions. With a generational collector,
>> these two are handled by separate parts of the collector.
>> 
> Different parts of the collector? Its a different collector depending on
> the generation.
> The young generation is collected with a copy collector. This is because
> almost all the objects
> in the young generation are likely dead, and a copy collector only needs
> to visit live objects. So
> its very efficient. The tenured generation uses something more along the
> lines of mark and sweep or mark
> and compact.
>   
>>  Per-request
>> garbage should completely fit in the short-term heap (nursery), so that it
>> can be collected rapidly and returned to use for further requests. If the
>> nursery is too small, the per-request allocations will be made in tenured
>> space and sit there until the next major GC. Cache evictions are almost
>> always in long-term storage (tenured space) because an LRU algorithm
>> guarantees that the garbage will be old.
>>
>> Check the growth rate of tenured space (under constant load, of course)
>> while increasing the size of the nursery. That rate should drop when the
>> nursery gets big enough, then not drop much further as it is increased
>> 
> more.
>   
>> After that, reduce the size of tenured space until major GCs start
>> 
> happening
>   
>> "too often" (a judgment call). A bigger tenured space means longer major
>> 
> GCs
>   
>> and thus longer pauses, so you don't want it oversized by too much.
>>   
>> 
> With the concurrent low pause collector, the goal is to avoid "major"
> collections,
> by collecting *before* the tenured space is filled. If you you are
> getting "major" collections,
> you need to tune your settings - the whole point of that collector is to
> avoid "major"
> collections, and do almost all of the work while your application is not
> paused. There are
> still 2 brief pauses during the collection, but they should not be
> significant at all.
>   
>> Also check the hit rates of your caches. If the hit rate is low, say 20%
>> 
> or
>   
>> less, make that cache much bigger or set it to zero. Either one will
>> 
> reduce
>   
>> the number of cache evictions. If you have an HTTP cache in front of Solr,
>> zero may be the right choice, since the HTTP cache is cherry-picking the
>> easily cacheable requests.
>>
>> Note that a commit nearly doubles the memory required, because you have
>> 
> two
>   
>> live Searcher objects with all their caches. Make sure you have headroom
>> 
> for
>   
>> a commit.
>>
>> If you want to test the tenured space usage, you must test with real world
>> queries. Those are the only way to get accurate cache eviction rates.
>>
>> wunder
>>
>> -Original Message-
>> From: Jonathan Ariel [mailto:ionat...@gmail.com] 
>> Sent: Friday, September 25, 2009 9:34 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr and Garbage Collection
>>
>> BTW why making them equal will lower the

Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel

Ok. I will try with the "concurrent low pause" collector and let you know
the results.
On Fri, Sep 25, 2009 at 2:23 PM, Walter Underwood wrote:

> As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
> pause" collector is only in the Sun JVM.
>
> I just found this excellent article about the various IBM GC options for a
> Lucene application with a 100GB heap:
>
>
> http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
> _h.html
>
> wunder
>
> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Friday, September 25, 2009 10:03 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr and Garbage Collection
>
> Walter Underwood wrote:
> > 30ms is not better or worse than 1s until you look at the service
> > requirements. For many applications, it is worth dedicating 10% of your
> > processing time to GC if that makes the worst-case pause short.
> >
> > On the other hand, my experience with the IBM JVM was that the maximum
> query
> > rate was 2-3X better with the concurrent generational GC compared to any
> of
> > their other GC algorithms, so we got the best throughput along with the
> > shortest pauses.
> >
> With which collector? Since the very early JVM's, all GC is generational.
> Most of the collectors (other than the Serial Collector) also work
> concurrently.
> By default, they are concurrent on different generations, but you can
> add concurrency
> to the "other" generation with each now too.
> > Solr garbage generation (for queries) seems to have two major components:
> > per-request garbage and cache evictions. With a generational collector,
> > these two are handled by separate parts of the collector.
> Different parts of the collector? Its a different collector depending on
> the generation.
> The young generation is collected with a copy collector. This is because
> almost all the objects
> in the young generation are likely dead, and a copy collector only needs
> to visit live objects. So
> its very efficient. The tenured generation uses something more along the
> lines of mark and sweep or mark
> and compact.
> >  Per-request
> > garbage should completely fit in the short-term heap (nursery), so that
> it
> > can be collected rapidly and returned to use for further requests. If the
> > nursery is too small, the per-request allocations will be made in tenured
> > space and sit there until the next major GC. Cache evictions are almost
> > always in long-term storage (tenured space) because an LRU algorithm
> > guarantees that the garbage will be old.
> >
> > Check the growth rate of tenured space (under constant load, of course)
> > while increasing the size of the nursery. That rate should drop when the
> > nursery gets big enough, then not drop much further as it is increased
> more.
> >
> > After that, reduce the size of tenured space until major GCs start
> happening
> > "too often" (a judgment call). A bigger tenured space means longer major
> GCs
> > and thus longer pauses, so you don't want it oversized by too much.
> >
> With the concurrent low pause collector, the goal is to avoid "major"
> collections,
> by collecting *before* the tenured space is filled. If you you are
> getting "major" collections,
> you need to tune your settings - the whole point of that collector is to
> avoid "major"
> collections, and do almost all of the work while your application is not
> paused. There are
> still 2 brief pauses during the collection, but they should not be
> significant at all.
> > Also check the hit rates of your caches. If the hit rate is low, say 20%
> or
> > less, make that cache much bigger or set it to zero. Either one will
> reduce
> > the number of cache evictions. If you have an HTTP cache in front of
> Solr,
> > zero may be the right choice, since the HTTP cache is cherry-picking the
> > easily cacheable requests.
> >
> > Note that a commit nearly doubles the memory required, because you have
> two
> > live Searcher objects with all their caches. Make sure you have headroom
> for
> > a commit.
> >
> > If you want to test the tenured space usage, you must test with real
> world
> > queries. Those are the only way to get accurate cache eviction rates.
> >
> > wunder
> >
> > -Original Message-
> > From: Jonathan Ariel [mailto:ionat...@gmail.com]
> > Sent: Friday, September 25, 2009 9:34 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr and Garbage Collection
> >
> > BTW why making them equal will lower the frequency of GC?
> >
> > On 9/25/09, Fuad Efendi  wrote:
> >
> >>> Bigger heaps lead to bigger GC pauses in general.
> >>>
> >> Opposite viewpoint:
> >> 1sec GC happening once an hour is MUCH BETTER than 30ms GC
> >>
> > once-per-second.
> >
> >> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
> >>
> >> Use -server option.
> >>
> >> -server option of JVM is 'native CPU code', I remember WebLogic 7
> console
> >> with SUN JVM 1.3 not showing any GC (just horizontal line).
>

RE: Solr and Garbage Collection

2009-09-25 Thread Walter Underwood

As I said, I was using the IBM JVM, not the Sun JVM. The "concurrent low
pause" collector is only in the Sun JVM.

I just found this excellent article about the various IBM GC options for a
Lucene application with a 100GB heap:

http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large
_h.html

wunder

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, September 25, 2009 10:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr and Garbage Collection

Walter Underwood wrote:
> 30ms is not better or worse than 1s until you look at the service
> requirements. For many applications, it is worth dedicating 10% of your
> processing time to GC if that makes the worst-case pause short.
>
> On the other hand, my experience with the IBM JVM was that the maximum
query
> rate was 2-3X better with the concurrent generational GC compared to any
of
> their other GC algorithms, so we got the best throughput along with the
> shortest pauses.
>   
With which collector? Since the very early JVM's, all GC is generational.
Most of the collectors (other than the Serial Collector) also work
concurrently.
By default, they are concurrent on different generations, but you can
add concurrency
to the "other" generation with each now too.
> Solr garbage generation (for queries) seems to have two major components:
> per-request garbage and cache evictions. With a generational collector,
> these two are handled by separate parts of the collector.
Different parts of the collector? Its a different collector depending on
the generation.
The young generation is collected with a copy collector. This is because
almost all the objects
in the young generation are likely dead, and a copy collector only needs
to visit live objects. So
its very efficient. The tenured generation uses something more along the
lines of mark and sweep or mark
and compact.
>  Per-request
> garbage should completely fit in the short-term heap (nursery), so that it
> can be collected rapidly and returned to use for further requests. If the
> nursery is too small, the per-request allocations will be made in tenured
> space and sit there until the next major GC. Cache evictions are almost
> always in long-term storage (tenured space) because an LRU algorithm
> guarantees that the garbage will be old.
>
> Check the growth rate of tenured space (under constant load, of course)
> while increasing the size of the nursery. That rate should drop when the
> nursery gets big enough, then not drop much further as it is increased
more.
>
> After that, reduce the size of tenured space until major GCs start
happening
> "too often" (a judgment call). A bigger tenured space means longer major
GCs
> and thus longer pauses, so you don't want it oversized by too much.
>   
With the concurrent low pause collector, the goal is to avoid "major"
collections,
by collecting *before* the tenured space is filled. If you you are
getting "major" collections,
you need to tune your settings - the whole point of that collector is to
avoid "major"
collections, and do almost all of the work while your application is not
paused. There are
still 2 brief pauses during the collection, but they should not be
significant at all.
> Also check the hit rates of your caches. If the hit rate is low, say 20%
or
> less, make that cache much bigger or set it to zero. Either one will
reduce
> the number of cache evictions. If you have an HTTP cache in front of Solr,
> zero may be the right choice, since the HTTP cache is cherry-picking the
> easily cacheable requests.
>
> Note that a commit nearly doubles the memory required, because you have
two
> live Searcher objects with all their caches. Make sure you have headroom
for
> a commit.
>
> If you want to test the tenured space usage, you must test with real world
> queries. Those are the only way to get accurate cache eviction rates.
>
> wunder
>
> -Original Message-
> From: Jonathan Ariel [mailto:ionat...@gmail.com] 
> Sent: Friday, September 25, 2009 9:34 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr and Garbage Collection
>
> BTW why making them equal will lower the frequency of GC?
>
> On 9/25/09, Fuad Efendi  wrote:
>   
>>> Bigger heaps lead to bigger GC pauses in general.
>>>   
>> Opposite viewpoint:
>> 1sec GC happening once an hour is MUCH BETTER than 30ms GC
>> 
> once-per-second.
>   
>> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
>>
>> Use -server option.
>>
>> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
>> with SUN JVM 1.3 not showing any GC (just horizontal line).
>>
>> -Fuad
>> http://www.linkedin.com/in/liferay
>>
>>
>>
>>
>> 
>
>
>   


-- 
- Mark

http://www.lucidimagination.com

boost function for date as unix stamp

2009-09-25 Thread Joe Calderon

hello *, i read on the wiki about using recip(rord(...)...) to boost
recent documents with a date field, does anyone have a good function
for doing something similar with unix timestamps?

if not, is there a lot of overhead related to counting the number of
distinct values for rord() ?


thx much

--joe

Re: download pre-release nightly solr 1.4

2009-09-25 Thread Mark Miller

michael8 wrote:
>
> markrmiller wrote:
>   
>> michael8 wrote:
>> 
>>> Hi,
>>>
>>> I know Solr 1.4 is going to be released any day now pending Lucene 2.9
>>> release.  Is there anywhere where one can download a pre-released nighly
>>> build of Solr 1.4 just for getting familiar with new features (e.g. field
>>> collapsing)?
>>>
>>> Thanks,
>>> Michael
>>>   
>>>   
>> You can download nightlies
>> here:http://people.apache.org/builds/lucene/solr/nightly/
>>
>> field collapsing won't be in 1.4 though. You have to build from svn
>> after applying the patch for that.
>>
>> -- 
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>>
>> 
>
> Thanks for the info Mark.  If field collapsing is a patch, can I apply the
> patch against 1.3 then?  Thanks again.
>
> Michael
>
>   
Not likely - it has to apply to the current code. If you can find an old
patch that works with 1.3 (not sure when the patches for that started),
its possible.
But you would be using a very old patch (not sure there is one that
applies to 1.3 trunk either, but you could check).

-- 
- Mark

http://www.lucidimagination.com

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

Walter Underwood wrote:
> 30ms is not better or worse than 1s until you look at the service
> requirements. For many applications, it is worth dedicating 10% of your
> processing time to GC if that makes the worst-case pause short.
>
> On the other hand, my experience with the IBM JVM was that the maximum query
> rate was 2-3X better with the concurrent generational GC compared to any of
> their other GC algorithms, so we got the best throughput along with the
> shortest pauses.
>   
With which collector? Since the very early JVM's, all GC is generational.
Most of the collectors (other than the Serial Collector) also work
concurrently.
By default, they are concurrent on different generations, but you can
add concurrency
to the "other" generation with each now too.
> Solr garbage generation (for queries) seems to have two major components:
> per-request garbage and cache evictions. With a generational collector,
> these two are handled by separate parts of the collector.
Different parts of the collector? Its a different collector depending on
the generation.
The young generation is collected with a copy collector. This is because
almost all the objects
in the young generation are likely dead, and a copy collector only needs
to visit live objects. So
its very efficient. The tenured generation uses something more along the
lines of mark and sweep or mark
and compact.
>  Per-request
> garbage should completely fit in the short-term heap (nursery), so that it
> can be collected rapidly and returned to use for further requests. If the
> nursery is too small, the per-request allocations will be made in tenured
> space and sit there until the next major GC. Cache evictions are almost
> always in long-term storage (tenured space) because an LRU algorithm
> guarantees that the garbage will be old.
>
> Check the growth rate of tenured space (under constant load, of course)
> while increasing the size of the nursery. That rate should drop when the
> nursery gets big enough, then not drop much further as it is increased more.
>
> After that, reduce the size of tenured space until major GCs start happening
> "too often" (a judgment call). A bigger tenured space means longer major GCs
> and thus longer pauses, so you don't want it oversized by too much.
>   
With the concurrent low pause collector, the goal is to avoid "major"
collections,
by collecting *before* the tenured space is filled. If you you are
getting "major" collections,
you need to tune your settings - the whole point of that collector is to
avoid "major"
collections, and do almost all of the work while your application is not
paused. There are
still 2 brief pauses during the collection, but they should not be
significant at all.
> Also check the hit rates of your caches. If the hit rate is low, say 20% or
> less, make that cache much bigger or set it to zero. Either one will reduce
> the number of cache evictions. If you have an HTTP cache in front of Solr,
> zero may be the right choice, since the HTTP cache is cherry-picking the
> easily cacheable requests.
>
> Note that a commit nearly doubles the memory required, because you have two
> live Searcher objects with all their caches. Make sure you have headroom for
> a commit.
>
> If you want to test the tenured space usage, you must test with real world
> queries. Those are the only way to get accurate cache eviction rates.
>
> wunder
>
> -Original Message-
> From: Jonathan Ariel [mailto:ionat...@gmail.com] 
> Sent: Friday, September 25, 2009 9:34 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr and Garbage Collection
>
> BTW why making them equal will lower the frequency of GC?
>
> On 9/25/09, Fuad Efendi  wrote:
>   
>>> Bigger heaps lead to bigger GC pauses in general.
>>>   
>> Opposite viewpoint:
>> 1sec GC happening once an hour is MUCH BETTER than 30ms GC
>> 
> once-per-second.
>   
>> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
>>
>> Use -server option.
>>
>> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
>> with SUN JVM 1.3 not showing any GC (just horizontal line).
>>
>> -Fuad
>> http://www.linkedin.com/in/liferay
>>
>>
>>
>>
>> 
>
>
>   


-- 
- Mark

http://www.lucidimagination.com

Re: download pre-release nightly solr 1.4

2009-09-25 Thread michael8




markrmiller wrote:
> 
> michael8 wrote:
>> Hi,
>>
>> I know Solr 1.4 is going to be released any day now pending Lucene 2.9
>> release.  Is there anywhere where one can download a pre-released nighly
>> build of Solr 1.4 just for getting familiar with new features (e.g. field
>> collapsing)?
>>
>> Thanks,
>> Michael
>>   
> You can download nightlies
> here:http://people.apache.org/builds/lucene/solr/nightly/
> 
> field collapsing won't be in 1.4 though. You have to build from svn
> after applying the patch for that.
> 
> -- 
> - Mark
> 
> http://www.lucidimagination.com
> 
> 
> 
> 
> 

Thanks for the info Mark.  If field collapsing is a patch, can I apply the
patch against 1.3 then?  Thanks again.

Michael

-- 
View this message in context: 
http://www.nabble.com/download-pre-release-nightly-solr-1.4-tp25590281p25615553.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr and Garbage Collection

2009-09-25 Thread Walter Underwood

30ms is not better or worse than 1s until you look at the service
requirements. For many applications, it is worth dedicating 10% of your
processing time to GC if that makes the worst-case pause short.

On the other hand, my experience with the IBM JVM was that the maximum query
rate was 2-3X better with the concurrent generational GC compared to any of
their other GC algorithms, so we got the best throughput along with the
shortest pauses.

Solr garbage generation (for queries) seems to have two major components:
per-request garbage and cache evictions. With a generational collector,
these two are handled by separate parts of the collector. Per-request
garbage should completely fit in the short-term heap (nursery), so that it
can be collected rapidly and returned to use for further requests. If the
nursery is too small, the per-request allocations will be made in tenured
space and sit there until the next major GC. Cache evictions are almost
always in long-term storage (tenured space) because an LRU algorithm
guarantees that the garbage will be old.

Check the growth rate of tenured space (under constant load, of course)
while increasing the size of the nursery. That rate should drop when the
nursery gets big enough, then not drop much further as it is increased more.

After that, reduce the size of tenured space until major GCs start happening
"too often" (a judgment call). A bigger tenured space means longer major GCs
and thus longer pauses, so you don't want it oversized by too much.

Also check the hit rates of your caches. If the hit rate is low, say 20% or
less, make that cache much bigger or set it to zero. Either one will reduce
the number of cache evictions. If you have an HTTP cache in front of Solr,
zero may be the right choice, since the HTTP cache is cherry-picking the
easily cacheable requests.

Note that a commit nearly doubles the memory required, because you have two
live Searcher objects with all their caches. Make sure you have headroom for
a commit.

If you want to test the tenured space usage, you must test with real world
queries. Those are the only way to get accurate cache eviction rates.

wunder

-Original Message-
From: Jonathan Ariel [mailto:ionat...@gmail.com] 
Sent: Friday, September 25, 2009 9:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr and Garbage Collection

BTW why making them equal will lower the frequency of GC?

On 9/25/09, Fuad Efendi  wrote:
>> Bigger heaps lead to bigger GC pauses in general.
>
> Opposite viewpoint:
> 1sec GC happening once an hour is MUCH BETTER than 30ms GC
once-per-second.
>
> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
>
> Use -server option.
>
> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
> with SUN JVM 1.3 not showing any GC (just horizontal line).
>
> -Fuad
> http://www.linkedin.com/in/liferay
>
>
>
>

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

>-server option of JVM is 'native CPU code', I remember WebLogic 7 console
>with SUN JVM 1.3 not showing any GC (just horizontal line).

Not sure what that is all about either. -server and -client are just two
different versions of hotspot.
The -server version is optimized for long running applications - it
starts slower, and over time, it learns
about your app and makes good throughput optimizations.

The -client hotspot version works faster quicker, and does concentrate
more on response than throughput.
Better for desktop apps. -server is better for long lived server apps.
Generally.

Mark Miller wrote:
> It won't really - it will just keep the JVM from wasting time resizing
> the heap on you. Since you know you need so much RAM anyway, no reason
> not to just pin it at what you need.
> Not going to help you much with GC though.
>
> Jonathan Ariel wrote:
>   
>> BTW why making them equal will lower the frequency of GC?
>>
>> On 9/25/09, Fuad Efendi  wrote:
>>   
>> 
 Bigger heaps lead to bigger GC pauses in general.
   
 
>>> Opposite viewpoint:
>>> 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second.
>>>
>>> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
>>>
>>> Use -server option.
>>>
>>> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
>>> with SUN JVM 1.3 not showing any GC (just horizontal line).
>>>
>>> -Fuad
>>> http://www.linkedin.com/in/liferay
>>>
>>>
>>>
>>>
>>> 
>>>   
>
>
>   


-- 
- Mark

http://www.lucidimagination.com

Re: Faceted Search on Dynamic Fields?

2009-09-25 Thread Yonik Seeley

On Fri, Sep 25, 2009 at 12:19 PM, Avlesh Singh  wrote:
> Faceting, as of now, can only be done of definitive field names.

To further clarify, the fields you can facet on can include those
defined by dynamic fields.  You just must specify the exact field name
when you facet.

   

Did you really mean for the ampersand to be in the dynamic field name?
 I'd advise against this, and it could be the source of your problems
(escaping the ampersand in your request, etc).

What is the exact facet request you are sending?


-Yonik
http://www.lucidimagination.com

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

It won't really - it will just keep the JVM from wasting time resizing
the heap on you. Since you know you need so much RAM anyway, no reason
not to just pin it at what you need.
Not going to help you much with GC though.

Jonathan Ariel wrote:
> BTW why making them equal will lower the frequency of GC?
>
> On 9/25/09, Fuad Efendi  wrote:
>   
>>> Bigger heaps lead to bigger GC pauses in general.
>>>   
>> Opposite viewpoint:
>> 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second.
>>
>> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
>>
>> Use -server option.
>>
>> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
>> with SUN JVM 1.3 not showing any GC (just horizontal line).
>>
>> -Fuad
>> http://www.linkedin.com/in/liferay
>>
>>
>>
>>
>> 


-- 
- Mark

http://www.lucidimagination.com

Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel

BTW why making them equal will lower the frequency of GC?

On 9/25/09, Fuad Efendi  wrote:
>> Bigger heaps lead to bigger GC pauses in general.
>
> Opposite viewpoint:
> 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second.
>
> To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)
>
> Use -server option.
>
> -server option of JVM is 'native CPU code', I remember WebLogic 7 console
> with SUN JVM 1.3 not showing any GC (just horizontal line).
>
> -Fuad
> http://www.linkedin.com/in/liferay
>
>
>
>

Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel

I can't really understand how increasing the heap will decrease the
11% dedicated to GC

On 9/25/09, Fuad Efendi  wrote:
>> You are saying that I should give more memory than 12GB?
>
>
> Yes. Look at this:
>
>> > SEVERE: java.lang.OutOfMemoryError: Java heap space
>>
> org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
>> 61
>> > )
>
>
>
> It can't find few (!!!) contiguous bytes for .createValue(...)
>
> It can't add (Field Value, Document ID) pair to an array.
>
> GC tuning won't help in this specific case...
>
> May be SOLR/Lucene core developers may WARM FieldCache at IndexReader
> opening time, in the future... to have early OOM...
>
>
> Avoiding faceting (and sorting) on such field will only postpone OOM to
> unpredictable date/time...
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
>
>
>
>

RE: Solr and Garbage Collection

2009-09-25 Thread cbennett

I would look at the JVM. Have you tried switching to the concurrent low
pause collector ?

Colin.


-Original Message-
From: Jonathan Ariel [mailto:ionat...@gmail.com] 
Sent: Friday, September 25, 2009 12:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr and Garbage Collection

You are saying that I should give more memory than 12GB?
When I was with 10GB I had the exceptions that I sent. Switching to 12GB
made them disappear.
So I think I don't have problems with FieldCache any more. What it seems
like a problem is 11% on the application time dedicated to GC. Specially
when those servers are under really heavy load.
I think that's why I sometimes get queries that in one moment are being
executed in a few ms and a moment after 20 seconds!

It seems like I should tune my jvm, don't you think so?

On Fri, Sep 25, 2009 at 1:01 PM, Fuad Efendi  wrote:

> Give it even more memory.
>
> Lucene FieldCache is used to store non-tokenized single-value non-boolean
> (DocumentId -> FieldValue) pairs, and it is used (in-full!) for instance
> for
> sorting query results.
>
> So that if you have 100,000,000 documents with specific heavily
distributed
> field values (cardinality is high! Size is 100bytes!) you need
> 10,000,000,000 bytes for just this instance of FieldCache.
>
> GC does not play any role. FieldCache won't be GC-collected.
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
>
>
>
> > -Original Message-
> > From: Jonathan Ariel [mailto:ionat...@gmail.com]
> > Sent: September-25-09 11:37 AM
> > To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> > Subject: Re: Solr and Garbage Collection
> >
> > Right, now I'm giving it 12GB of heap memory.
> > If I give it less (10GB) it throws the following exception:
> >
> > Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log
> > SEVERE: java.lang.OutOfMemoryError: Java heap space
> > at
> >
>
>
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
> 61
> > )
> > at
> >
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
> > at
> >
>
>
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:3
> 52
> > )
> > at
> >
>
>
org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:2
> 67
> > )
> > at
> >
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185)
> > at
> >
>
>
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:2
> 07
> > )
> > at
> >
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104)
> > at
> >
>
>
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
> :7
> > 0)
> > at
> >
>
>
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
> le
> > r.java:169)
> > at
> >
>
>
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
> ja
> > va:131)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
> > at
> >
>
>
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
> 03
> > )
> > at
> >
>
>
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
> 23
> > 2)
> > at
> >
>
>
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
> .j
> > ava:1089)
> > at
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> > at
> >
>
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> > at
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> > at
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> > at
> > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> > at
> >
>
>
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
> ec
> > tion.java:211)
> > at
> >
>
>
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11
> 4)
> > at
> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> > at org.mortbay.jetty.Server.handle(Server.java:285)
> > at
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> > at
> >
>
>
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:
> 83
> > 5)
> > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> > at
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
> > at
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> > at
> >
>
>
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22
> 6)
> > at
> >
>
>
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:4
> 42
> > )
> >
> > On Fri, Sep 25, 2009 at 10:55 AM, Yonik Seeley
> > wrote:
> >
> > > On Fri, Sep 25, 2009 at 9:30 AM, Jonathan

Re: Faceted Search on Dynamic Fields?

2009-09-25 Thread Avlesh Singh

Faceting, as of now, can only be done of definitive field names. Faceting on
field names matching wildcards (dynamic field being one such scenario) is
yet to be supported. There are lot of open issues, aiming to achieve this.
Find a similar discussion here -
http://www.lucidimagination.com/search/document/787cc8cc9ea095e6/item_facet

Cheers
Avlesh

On Fri, Sep 25, 2009 at 7:47 PM, danben  wrote:

>
> Also, here is the field definition in the schema
>
> indexed="true" stored="true" multiValued="true"/>
>
>
> --
> View this message in context:
> http://www.nabble.com/Faceted-Search-on-Dynamic-Fields--tp25612887p25612936.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

Yes - more RAM is not a solution to your problem.

Jonathan Ariel wrote:
> You are saying that I should give more memory than 12GB?
> When I was with 10GB I had the exceptions that I sent. Switching to 12GB
> made them disappear.
> So I think I don't have problems with FieldCache any more. What it seems
> like a problem is 11% on the application time dedicated to GC. Specially
> when those servers are under really heavy load.
> I think that's why I sometimes get queries that in one moment are being
> executed in a few ms and a moment after 20 seconds!
>
> It seems like I should tune my jvm, don't you think so?
>
> On Fri, Sep 25, 2009 at 1:01 PM, Fuad Efendi  wrote:
>
>   
>> Give it even more memory.
>>
>> Lucene FieldCache is used to store non-tokenized single-value non-boolean
>> (DocumentId -> FieldValue) pairs, and it is used (in-full!) for instance
>> for
>> sorting query results.
>>
>> So that if you have 100,000,000 documents with specific heavily distributed
>> field values (cardinality is high! Size is 100bytes!) you need
>> 10,000,000,000 bytes for just this instance of FieldCache.
>>
>> GC does not play any role. FieldCache won't be GC-collected.
>>
>>
>> -Fuad
>> http://www.linkedin.com/in/liferay
>>
>>
>>
>> 
>>> -Original Message-
>>> From: Jonathan Ariel [mailto:ionat...@gmail.com]
>>> Sent: September-25-09 11:37 AM
>>> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
>>> Subject: Re: Solr and Garbage Collection
>>>
>>> Right, now I'm giving it 12GB of heap memory.
>>> If I give it less (10GB) it throws the following exception:
>>>
>>> Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>>> at
>>>
>>>   
>> org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
>> 61
>> 
>>> )
>>> at
>>> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
>>> at
>>>
>>>   
>> org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:3
>> 52
>> 
>>> )
>>> at
>>>
>>>   
>> org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:2
>> 67
>> 
>>> )
>>> at
>>> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185)
>>> at
>>>
>>>   
>> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:2
>> 07
>> 
>>> )
>>> at
>>>
>>>   
>> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104)
>> 
>>> at
>>>
>>>   
>> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
>> :7
>> 
>>> 0)
>>> at
>>>
>>>   
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
>> le
>> 
>>> r.java:169)
>>> at
>>>
>>>   
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
>> ja
>> 
>>> va:131)
>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
>>> at
>>>
>>>   
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
>> 03
>> 
>>> )
>>> at
>>>
>>>   
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
>> 23
>> 
>>> 2)
>>> at
>>>
>>>   
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
>> .j
>> 
>>> ava:1089)
>>> at
>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>> at
>>>
>>>   
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>> 
>>> at
>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>> at
>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>> at
>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>> at
>>>
>>>   
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
>> ec
>> 
>>> tion.java:211)
>>> at
>>>
>>>   
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11
>> 4)
>> 
>>> at
>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>> at org.mortbay.jetty.Server.handle(Server.java:285)
>>> at
>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>> at
>>>
>>>   
>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:
>> 83
>> 
>>> 5)
>>> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>>> at
>>>   
>> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>> 
>>> at
>>>   
>> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>> 
>>> at
>>>
>>>   
>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22
>> 6)
>> 
>>> at
>>>
>>>   
>> org.mortbay.th

RE: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi

> You are saying that I should give more memory than 12GB?


Yes. Look at this:

> > SEVERE: java.lang.OutOfMemoryError: Java heap space
>
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
> 61
> > )



It can't find few (!!!) contiguous bytes for .createValue(...)

It can't add (Field Value, Document ID) pair to an array.

GC tuning won't help in this specific case...

May be SOLR/Lucene core developers may WARM FieldCache at IndexReader
opening time, in the future... to have early OOM...


Avoiding faceting (and sorting) on such field will only postpone OOM to
unpredictable date/time...


-Fuad
http://www.linkedin.com/in/liferay

Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel

You are saying that I should give more memory than 12GB?
When I was with 10GB I had the exceptions that I sent. Switching to 12GB
made them disappear.
So I think I don't have problems with FieldCache any more. What it seems
like a problem is 11% on the application time dedicated to GC. Specially
when those servers are under really heavy load.
I think that's why I sometimes get queries that in one moment are being
executed in a few ms and a moment after 20 seconds!

It seems like I should tune my jvm, don't you think so?

On Fri, Sep 25, 2009 at 1:01 PM, Fuad Efendi  wrote:

> Give it even more memory.
>
> Lucene FieldCache is used to store non-tokenized single-value non-boolean
> (DocumentId -> FieldValue) pairs, and it is used (in-full!) for instance
> for
> sorting query results.
>
> So that if you have 100,000,000 documents with specific heavily distributed
> field values (cardinality is high! Size is 100bytes!) you need
> 10,000,000,000 bytes for just this instance of FieldCache.
>
> GC does not play any role. FieldCache won't be GC-collected.
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
>
>
>
> > -Original Message-
> > From: Jonathan Ariel [mailto:ionat...@gmail.com]
> > Sent: September-25-09 11:37 AM
> > To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> > Subject: Re: Solr and Garbage Collection
> >
> > Right, now I'm giving it 12GB of heap memory.
> > If I give it less (10GB) it throws the following exception:
> >
> > Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log
> > SEVERE: java.lang.OutOfMemoryError: Java heap space
> > at
> >
>
> org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
> 61
> > )
> > at
> > org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
> > at
> >
>
> org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:3
> 52
> > )
> > at
> >
>
> org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:2
> 67
> > )
> > at
> > org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185)
> > at
> >
>
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:2
> 07
> > )
> > at
> >
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104)
> > at
> >
>
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
> :7
> > 0)
> > at
> >
>
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
> le
> > r.java:169)
> > at
> >
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
> ja
> > va:131)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
> > at
> >
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
> 03
> > )
> > at
> >
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
> 23
> > 2)
> > at
> >
>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
> .j
> > ava:1089)
> > at
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> > at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> > at
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> > at
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> > at
> > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> > at
> >
>
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
> ec
> > tion.java:211)
> > at
> >
>
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11
> 4)
> > at
> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> > at org.mortbay.jetty.Server.handle(Server.java:285)
> > at
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> > at
> >
>
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:
> 83
> > 5)
> > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> > at
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
> > at
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> > at
> >
>
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22
> 6)
> > at
> >
>
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:4
> 42
> > )
> >
> > On Fri, Sep 25, 2009 at 10:55 AM, Yonik Seeley
> > wrote:
> >
> > > On Fri, Sep 25, 2009 at 9:30 AM, Jonathan Ariel 
> > > wrote:
> > > > Hi to all!
> > > > Lately my solr servers seem to stop responding once in a while. I'm
> using
> > > > solr 1.3.
> > > > Of course I'm having more traffic on the servers.
> > > > So I logged the Garbage Collection activity to check if it's because
> of
> > > > tha

RE: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi

Give it even more memory.

Lucene FieldCache is used to store non-tokenized single-value non-boolean
(DocumentId -> FieldValue) pairs, and it is used (in-full!) for instance for
sorting query results.

So that if you have 100,000,000 documents with specific heavily distributed
field values (cardinality is high! Size is 100bytes!) you need
10,000,000,000 bytes for just this instance of FieldCache.

GC does not play any role. FieldCache won't be GC-collected.


-Fuad
http://www.linkedin.com/in/liferay



> -Original Message-
> From: Jonathan Ariel [mailto:ionat...@gmail.com]
> Sent: September-25-09 11:37 AM
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Subject: Re: Solr and Garbage Collection
> 
> Right, now I'm giving it 12GB of heap memory.
> If I give it less (10GB) it throws the following exception:
> 
> Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.OutOfMemoryError: Java heap space
> at
>
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
61
> )
> at
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
> at
>
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:3
52
> )
> at
>
org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:2
67
> )
> at
> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185)
> at
>
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:2
07
> )
> at
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104)
> at
>
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:7
> 0)
> at
>
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
le
> r.java:169)
> at
>
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
ja
> va:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
> at
>
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
03
> )
> at
>
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
23
> 2)
> at
>
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
.j
> ava:1089)
> at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> at
>
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> at
>
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
ec
> tion.java:211)
> at
>
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11
4)
> at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> at org.mortbay.jetty.Server.handle(Server.java:285)
> at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> at
>
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:
83
> 5)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
> at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> at
>
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22
6)
> at
>
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:4
42
> )
> 
> On Fri, Sep 25, 2009 at 10:55 AM, Yonik Seeley
> wrote:
> 
> > On Fri, Sep 25, 2009 at 9:30 AM, Jonathan Ariel 
> > wrote:
> > > Hi to all!
> > > Lately my solr servers seem to stop responding once in a while. I'm
using
> > > solr 1.3.
> > > Of course I'm having more traffic on the servers.
> > > So I logged the Garbage Collection activity to check if it's because
of
> > > that. It seems like 11% of the time the application runs, it is
stopped
> > > because of GC. And some times the GC takes up to 10 seconds!
> > > Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel
Xeon
> > > servers. My index is around 10GB and I'm giving to the instances 10GB
of
> > > RAM.
> >
> > Bigger heaps lead to bigger GC pauses in general.
> > Do you mean that you are giving the JVM a 10GB heap?  Were you getting
> > OOM exceptions with a smaller heap?
> >
> > -Yonik
> > http://www.lucidimagination.com
> >

RE: Solr and Garbage Collection

2009-09-25 Thread Fuad Efendi

> Bigger heaps lead to bigger GC pauses in general.

Opposite viewpoint:
1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second. 

To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!)

Use -server option.

-server option of JVM is 'native CPU code', I remember WebLogic 7 console
with SUN JVM 1.3 not showing any GC (just horizontal line). 

-Fuad
http://www.linkedin.com/in/liferay

FACET_SORT_INDEX descending?

2009-09-25 Thread Gerald Snyder

Is there any value for the "f.my_year_facet.facet.sort"  parameter that 
will return the facet values in descending order?   So far I only see 
"index" and "count" as the choices. 


http://lucene.apache.org/solr/api/org/apache/solr/common/params/FacetParams.html#FACET_SORT_INDEX

Thanks.
Gerald Snyder 
Florida Center for Library Automation

Re: Solr and Garbage Collection

2009-09-25 Thread Mark Miller

I've got the start of a Garbage Collection article here:
http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/

I plan to tie it more into Lucene/Solr and add some more about the
theory/methods in the final version.

With so much RAM, I take it you prob have a handful of processors as well?

You might start by trying the Concurrent Low Pause Collector if you have
not. You might also pair it with the parallel new generation collector.
If you still get long pauses, you might try lowering
-XX:CMSInitiatingOccupancyFraction, to kick off major collections earlier.

It can still be difficult with really large fieldcaches, because all of
sudden, everything is released at once when the Reader goes away - but
there should be some combo of settings that at least help alleviate the
issue, especially by dedicating another processor to the task that can
work somewhat in parallel without stopping your application threads for
so long.

If you have some success tuning, report back with your results if you could.

-- 
- Mark

http://www.lucidimagination.com



Jonathan Ariel wrote:
> Right, now I'm giving it 12GB of heap memory.
> If I give it less (10GB) it throws the following exception:
>
> Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:361)
> at
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
> at
> org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:352)
> at
> org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:267)
> at
> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185)
> at
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:207)
> at
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104)
> at
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:70)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:169)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
> at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> at org.mortbay.jetty.Server.handle(Server.java:285)
> at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>
> On Fri, Sep 25, 2009 at 10:55 AM, Yonik Seeley
> wrote:
>
>   
>> On Fri, Sep 25, 2009 at 9:30 AM, Jonathan Ariel 
>> wrote:
>> 
>>> Hi to all!
>>> Lately my solr servers seem to stop responding once in a while. I'm using
>>> solr 1.3.
>>> Of course I'm having more traffic on the servers.
>>> So I logged the Garbage Collection activity to check if it's because of
>>> that. It seems like 11% of the time the application runs, it is stopped
>>> because of GC. And some times the GC takes up to 10 seconds!
>>> Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
>>> servers. My index is around 10GB and I'm giving to the instances 10GB of
>>> RAM.
>>>   
>> Bigger heaps lead to bigger GC pauses in general.
>> Do you mean that you are giving the JVM a 10GB heap?  Were you getting
>> OOM exceptions with a smaller heap?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> 
>
>

RE: Solr and Garbage Collection

2009-09-25 Thread cbennett

Hi,

Have you looked at tuning the garbage collection ?

Take a look at the following articles

http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot
-camp-draft/
http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html

Changing to the concurrent or throughput collector should help with the long
pauses.

Colin.

-Original Message-
From: Jonathan Ariel [mailto:ionat...@gmail.com] 
Sent: Friday, September 25, 2009 11:37 AM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Subject: Re: Solr and Garbage Collection

Right, now I'm giving it 12GB of heap memory.
If I give it less (10GB) it throws the following exception:

Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3
61)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
at
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:3
52)
at
org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:2
67)
at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185)
at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:2
07)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:70)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
ler.java:169)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
03)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
232)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
.java:1089)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
ection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11
4)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:
835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22
6)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:4
42)

On Fri, Sep 25, 2009 at 10:55 AM, Yonik Seeley
wrote:

> On Fri, Sep 25, 2009 at 9:30 AM, Jonathan Ariel 
> wrote:
> > Hi to all!
> > Lately my solr servers seem to stop responding once in a while. I'm
using
> > solr 1.3.
> > Of course I'm having more traffic on the servers.
> > So I logged the Garbage Collection activity to check if it's because of
> > that. It seems like 11% of the time the application runs, it is stopped
> > because of GC. And some times the GC takes up to 10 seconds!
> > Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
> > servers. My index is around 10GB and I'm giving to the instances 10GB of
> > RAM.
>
> Bigger heaps lead to bigger GC pauses in general.
> Do you mean that you are giving the JVM a 10GB heap?  Were you getting
> OOM exceptions with a smaller heap?
>
> -Yonik
> http://www.lucidimagination.com
>

Re: Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel

Right, now I'm giving it 12GB of heap memory.
If I give it less (10GB) it throws the following exception:

Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:361)
at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
at
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:352)
at
org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:267)
at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185)
at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:207)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:70)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:169)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

On Fri, Sep 25, 2009 at 10:55 AM, Yonik Seeley
wrote:

> On Fri, Sep 25, 2009 at 9:30 AM, Jonathan Ariel 
> wrote:
> > Hi to all!
> > Lately my solr servers seem to stop responding once in a while. I'm using
> > solr 1.3.
> > Of course I'm having more traffic on the servers.
> > So I logged the Garbage Collection activity to check if it's because of
> > that. It seems like 11% of the time the application runs, it is stopped
> > because of GC. And some times the GC takes up to 10 seconds!
> > Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
> > servers. My index is around 10GB and I'm giving to the instances 10GB of
> > RAM.
>
> Bigger heaps lead to bigger GC pauses in general.
> Do you mean that you are giving the JVM a 10GB heap?  Were you getting
> OOM exceptions with a smaller heap?
>
> -Yonik
> http://www.lucidimagination.com
>

Re: What options would you recommend for the Sun JVM?

2009-09-25 Thread Grant Ingersoll



On Sep 25, 2009, at 7:30 AM, Jérôme Etévé wrote:


Hi solr addicts,

I know there's no one size fits all set of options for the sun JVM,
but I think It'd be useful to everyone to share your tips on using the
sun JVM with solr.

For instance, I recently figured out that setting the tenured
generation garbage collection to Concurrent mark and sweep (
-XX:+UseConcMarkSweepGC )  have dramatically decreased the amount of
time java hangs on tenured gen. garbage collecting. On my settings,
the old gen. garbage collection went from big time chunks of 1~2
second to multiple small slices of ~0.2 s.

As a result, the commits (hence the searcher drop/rebuild) are much
less painful from the application performance point of view.

What are the other options you would recommend?



Mark M. just posted some on this: 
http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/

I'm usually wary of futzing too much w/ the parameters.  With a proper  
machine (multiple cores), the concurrent low pause collector is the  
way to go and I usually leave it at that.  Beyond that, I don't  
usually recommend getting involved with too many parameters.  GC  
settings are often a black art and I've seen many cases of people  
going down the rat hole of trying to figure out how to get their GC  
right, when they could have spent far less time thinking about how  
they model their domain in Lucene/Solr to produce less garbage in the  
first place without effecting functionality one bit.  After all, most  
shops have domain expertise and not GC expertise.


The other thing to be wary of is too big of a heap.  Basically, take a  
pragmatic approach and test under load using real queries/indexing and  
see what the heap high water mark is (which is almost always evident  
during and right after a commit) and then set it at that plus maybe 1  
more GB just to be on the safe side.


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Faceted Search on Dynamic Fields?

2009-09-25 Thread danben


Also, here is the field definition in the schema




-- 
View this message in context: 
http://www.nabble.com/Faceted-Search-on-Dynamic-Fields--tp25612887p25612936.html
Sent from the Solr - User mailing list archive at Nabble.com.

Faceted Search on Dynamic Fields?

2009-09-25 Thread danben


I'm trying to perform a faceted query with the facet field referencing a
field that is not in the schema but matches a dynamicField with its suffix. 
The query returns results but for some reason the facet list is always
empty.  When I change the facet field to one that is explicitly named in the
schema I get the proper results.  Is this expected behavior?  I wasn't able
to find anything in the docs about dynamic fields wrt faceting.

One other thing I thought might have been causing the problem is that the
values in this field are mostly distinct (that won't be the case in the
actual application, I'm just doing it this way now to see how faceted
queries behave).  However, when I performed the same query with a static
field with lots of distinct values I just got an OutOfMemoryError, which
leads me back to my original hypothesis.

So, is it the case that faceted queries are not permitted on dynamic facet
fields, and if so, is there any workaround?
-- 
View this message in context: 
http://www.nabble.com/Faceted-Search-on-Dynamic-Fields--tp25612887p25612887.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Mixed field types and boolean searching

2009-09-25 Thread Ensdorf Ken

> No- there are various analyzers. StandardAnalyzer is geared toward
> searching bodies of text for interesting words -  punctuation is
> ripped out. Other analyzers are more useful for "concrete" text. You
> may have to work at finding one that leaves punctuation in.
> 

My problem is not with the StandardAnalyzer per se, but more as to how "dismax" 
style queries are handled by the query parser when the different fields have 
different sets of ignored tokens or stop words.

Say you want to use the contents of a text box in your app and query a field in 
Solr.  The user enters "A and B", so you map this to "f1:A and f1:B".  Now, if 
"B" is an ignored token in the "f1" field for whatever reason, the query boils 
down to "f1:A".  

Now imagine you want to allow the user's text to match multiple fields - as in 
any term can match any field, but all terms must match at least 1 field.  So 
now you map the user's query to "(f1:A OR f2:A) AND (f1:B OR f2:B)".  But if f2 
does not ignore "B", the query boils down to "(f1:A OR f2:A) AND (f2:B)".  Now 
documents that could come back when you were only matching against the f1 field 
don't come back.  

This seems counter-intuitive - to be consistent, I would think the query should 
essentially be treated as "(f1:A OR f2:A) AND (TRUE OR f2:B) " - and thus a 
term that is a stop word or ignored token for any of the fields would be 
ignored across the board.

So I guess what I'm asking is if there is a reason for the existing behavior, 
or is it just a fact-of-life of the query parser?  Thanks!

-Ken

Re: Parallel requests to Tomcat

2009-09-25 Thread Michael

Thank you Grant and Lance for your comments -- I've run into a separate snag
which puts this on hold for a bit, but I'll return to finish digging into
this and post my results. - Michael
On Thu, Sep 24, 2009 at 9:23 PM, Lance Norskog  wrote:

> Are you on Java 5, 6 or 7? Each release sees some tweaking of the Java
> multithreading model as well as performance improvements (and bug
> fixes) in the Sun HotSpot runtime.
>
> You may be tripping over the TCP/IP multithreaded connection manager.
> You might wish to create each client thread with a separate socket.
>
> Also, here is a standard bit of benchmarking advice: include "think
> time". This means that instead of sending requests constantly, each
> thread should time out for a few seconds before sending the next
> request. This simulates a user "stopping and thinking" before clicking
> the mouse again. This helps simulate the quantity of threads, etc.
> which are stopped and waiting at each stage of the request pipeline.
> As it is, you are trying to simulate the throughput behaviour without
> simulating the horizontal volume. (Benchmarking is much harder than it
> looks.)
>
> On Wed, Sep 23, 2009 at 9:43 AM, Grant Ingersoll 
> wrote:
> >
> > On Sep 23, 2009, at 12:09 PM, Michael wrote:
> >
> >> On Wed, Sep 23, 2009 at 12:05 PM, Yonik Seeley
> >> wrote:
> >>
> >>> On Wed, Sep 23, 2009 at 11:47 AM, Michael  wrote:
> 
>  If this were IO bound, wouldn't I see the same results when sending my
> 8
>  requests to 8 Tomcats?  There's only one "disk" (well, RAM) whether
> I'm
>  querying 8 processes or 8 threads in 1 process, right?
> >>>
> >>> Right - I was thinking IO bound at the Lucene Directory level - which
> >>> synchronized in the past and led to poor concurrency.  Buy your Solr
> >>> version is recent enough to use the newer unsynchronized method by
> >>> default (on non-windows)
> >>>
> >>
> >> Ah, OK.  So it looks like comparing to Jetty is my only next step.
> >>  Although
> >> I'm not sure what I'm going to do based on the result of that test -- if
> >> Jetty behaves differently, then I still don't know why the heck Tomcat
> is
> >> behaving badly! :)
> >
> >
> > Have you done any profiling to see where hotspots are?  Have you looked
> at
> > garbage collection?  Do you have any full collections occurring?  What
> > garbage collector are you using?  How often are you updating/committing,
> > etc?
> >
> >
> > --
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> > Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Re: Can we point a Solr server to index directory dynamically at runtime..

2009-09-25 Thread Michael

Are you storing (in addition to indexing) your data?  Perhaps you could turn
off storage on data older than 7 days (requires reindexing), thus losing the
ability to return snippets but cutting down on your storage space and server
count.  I've experienced 10x decrease in space requirements and a large
boost in speed after cutting extraneous storage from Solr -- the stored data
is mixed in with the index data and so it slows down searches.
You could also put all 200G onto one Solr instance rather than 10 for >7days
data, and accept that those searches will be slower.

Michael

On Fri, Sep 25, 2009 at 1:34 AM, Silent Surfer wrote:

> Hi,
>
> Thank you Michael and Chris for the response.
>
> Today after the mail from Michael, we tested with the dynamic loading of
> cores and it worked well. So we need to go with the hybrid approach of
> Multicore and Distributed searching.
>
> As per our testing, we found that a Solr instance with 20 GB of
> index(single index or spread across multiple cores) can provide better
> performance when compared to having a Solr instance say 40 (or) 50 GB of
> index (single index or index spread across cores).
>
> So the 200 GB of index on day 1 will be spread across 200/20=10 Solr salve
> instances.
>
> On day 2 data, 10 more Solr slave servers are required; Cumulative Solr
> Slave instances = 200*2/20=20
> ...
> ..
> ..
> On day 30 data, 10 more Solr slave servers are required; Cumulative Solr
> Slave instances = 200*30/20=300
>
> So with the above approach, we may need ~300 Solr slave instances, which
> becomes very unmanageable.
>
> But we know that most of the queries is for the past 1 week, i.e we
> definitely need 70 Solr Slaves containing the last 7 days worth of data up
> and running.
>
> Now for the rest of the 230 Solr instances, do we need to keep it running
> for the odd query,that can span across the 30 days of data (30*200 GB=6 TB
> data) which can come up only a couple of times a day.
> This linear increase of Solr servers with the retention period doesn't
> seems to be a very scalable solution.
>
> So we are looking for something more simpler approach to handle this
> scenario.
>
> Appreciate any further inputs/suggestions.
>
> Regards,
> sS
>
> --- On Fri, 9/25/09, Chris Hostetter  wrote:
>
> > From: Chris Hostetter 
> > Subject: Re: Can we point a Solr server to index directory dynamically
> at  runtime..
> > To: solr-user@lucene.apache.org
> > Date: Friday, September 25, 2009, 4:04 AM
> > : Using a multicore approach, you
> > could send a "create a core named
> > : 'core3weeksold' pointing to '/datadirs/3weeksold' "
> > command to a live Solr,
> > : which would spin it up on the fly.  Then you query
> > it, and maybe keep it
> > : spun up until it's not queried for 60 seconds or
> > something, then send a
> > : "remove core 'core3weeksold' " command.
> > : See http://wiki.apache.org/solr/CoreAdmin#CoreAdminHandler
> > .
> >
> > something that seems implicit in the question is what to do
> > when the
> > request spans all of the data ... this is where (in theory)
> > distributed
> > searching could help you out.
> >
> > index each days worth of data into it's own core, that
> > makes it really
> > easy to expire the old data (just UNLOAD and delete an
> > entire core once
> > it's more then 30 days old) if your user is only searching
> > "current" dta
> > then your app can directly query the core containing the
> > most current data
> > -- but if they want to query the last week, or last two
> > weeks worth of
> > data, you do a distributed request for all of the shards
> > needed to search
> > the appropriate amount of data.
> >
> > Between the ALIAS and SWAP commands it on the CoreAdmin
> > screen it should
> > be pretty easy have cores with names like
> > "today","1dayold","2dayold" so
> > that your app can configure simple shard params for all the
> > perumations
> > you'll need to query.
> >
> >
> > -Hoss
> >
> >
>
>
>
>
>
>

Re: OOM error during merge - index still ok?

2009-09-25 Thread Yonik Seeley

On Fri, Sep 25, 2009 at 8:20 AM, Phillip Farber  wrote:
>  Can I expect the index to be left in a usable state ofter an out of memory
> error during a merge or it it most likely to be corrupt?

It should be in the state it was after the last successful commit.

-Yonik
http://www.lucidimagination.com

>  I'd really hate to
> have to start this index build again from square one.  Thanks.
>
> Thanks,
>
> Phil
>
> ---
> Exception in thread "http-8080-Processor2505" java.lang.OutOfMemoryError:
> Java heap space
> Exception in thread "RMI TCP Connection(131)-141.213.128.155"
> java.lang.OutOfMemoryError: Java heap space
> Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]"
> java.lang.OutOfMemoryError: Java heap space
> Exception in thread "http-8080-Processor2537" java.lang.OutOfMemoryError:
> Java heap space
> Exception in thread "http-8080-Processor2483" Exception in thread "RMI
> Scheduler(0)" java.lang.OutOfMemoryError: Java heap space
> java.lang.OutOfMemoryError: Java heap space
> Exception in thread "Lucene Merge Thread #202"
> org.apache.lucene.index.MergePolicy$MergeException:
> java.lang.OutOfMemoryError: Java heap space
>   at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
>   at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> Exception in thread "Lucene Merge Thread #266"
> org.apache.lucene.index.MergePolicy$MergeException:
> java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot
> merge
>   at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
>   at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)
> Caused by: java.lang.IllegalStateException: this writer hit an
> OutOfMemoryError; cannot merge
>   at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4529)
>   at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4512)
>   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4424)
>   at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:235)
>   at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291)
> WARN: The method class
> org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
> WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
> WARN: The method class
> org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
> WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
>
>

RE: Alphanumeric Wild Card Search Question

2009-09-25 Thread Carr, Adrian

In case it helps, here's what I have currently, but I've been messing with 
different options:

-Original Message-
From: Carr, Adrian [mailto:adrian.c...@jtv.com] 
Sent: Friday, September 25, 2009 9:28 AM
To: solr-user@lucene.apache.org
Subject: RE: Alphanumeric Wild Card Search Question

Hi Ken,
I am using the WordDelimiterFilterFactory. I thought I needed it because I 
thought that's what gave me the control over the options of how the words are 
split and indexed? I did try taking it out completely, but that didn't seem to 
help.

I'll try the analysis tool today. There has got to be a simple solution for 
this, but it is sure eluding me.
Thanks,
Adrian

-Original Message-
From: Ensdorf Ken [mailto:ensd...@zoominfo.com]
Sent: Thursday, September 24, 2009 5:03 PM
To: solr-user@lucene.apache.org
Subject: RE: Alphanumeric Wild Card Search Question

> Here's my question:
> I have some products that I want to allow people to search for with 
> wild cards. For example, if my product is YBM354, I'd like for users 
> to be able to search on "YBM*", "YBM3*", "YBM35*" and for any of these 
> searches to return that product. I've found that I can search for 
> "YBM*" and get the product, just not the other combinations.

Are you using WordDelimiterFilterFactory?  That would explain this behavior.

If so, do you need it - for the queries you describe you don't need that kind 
of tokenization.

Also, have you played with the analysis tool on the admin page, it is a great 
help in debugging things like this.

-Ken

Re: Solr and Garbage Collection

2009-09-25 Thread Yonik Seeley

On Fri, Sep 25, 2009 at 9:30 AM, Jonathan Ariel  wrote:
> Hi to all!
> Lately my solr servers seem to stop responding once in a while. I'm using
> solr 1.3.
> Of course I'm having more traffic on the servers.
> So I logged the Garbage Collection activity to check if it's because of
> that. It seems like 11% of the time the application runs, it is stopped
> because of GC. And some times the GC takes up to 10 seconds!
> Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
> servers. My index is around 10GB and I'm giving to the instances 10GB of
> RAM.

Bigger heaps lead to bigger GC pauses in general.
Do you mean that you are giving the JVM a 10GB heap?  Were you getting
OOM exceptions with a smaller heap?

-Yonik
http://www.lucidimagination.com

Solr and Garbage Collection

2009-09-25 Thread Jonathan Ariel

Hi to all!
Lately my solr servers seem to stop responding once in a while. I'm using
solr 1.3.
Of course I'm having more traffic on the servers.
So I logged the Garbage Collection activity to check if it's because of
that. It seems like 11% of the time the application runs, it is stopped
because of GC. And some times the GC takes up to 10 seconds!
Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon
servers. My index is around 10GB and I'm giving to the instances 10GB of
RAM.

How can I check which is the GC that it is being used? If I'm right JVM
Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have
any recommendation on this?

Thanks,

Jonathan

RE: Alphanumeric Wild Card Search Question

2009-09-25 Thread Carr, Adrian

Hi Ken,
I am using the WordDelimiterFilterFactory. I thought I needed it because I 
thought that's what gave me the control over the options of how the words are 
split and indexed? I did try taking it out completely, but that didn't seem to 
help.

I'll try the analysis tool today. There has got to be a simple solution for 
this, but it is sure eluding me.
Thanks,
Adrian

-Original Message-
From: Ensdorf Ken [mailto:ensd...@zoominfo.com] 
Sent: Thursday, September 24, 2009 5:03 PM
To: solr-user@lucene.apache.org
Subject: RE: Alphanumeric Wild Card Search Question

> Here's my question:
> I have some products that I want to allow people to search for with 
> wild cards. For example, if my product is YBM354, I'd like for users 
> to be able to search on "YBM*", "YBM3*", "YBM35*" and for any of these 
> searches to return that product. I've found that I can search for 
> "YBM*" and get the product, just not the other combinations.

Are you using WordDelimiterFilterFactory?  That would explain this behavior.

If so, do you need it - for the queries you describe you don't need that kind 
of tokenization.

Also, have you played with the analysis tool on the admin page, it is a great 
help in debugging things like this.

-Ken

DIH & RSS > 1.4 nightly 2009-09-25 > full-import&clean=false always clean and import command do nothing

2009-09-25 Thread Brahim Abdesslam


Hello everybody,

we are using Solr to index some RSS feeds for a news agregator application.

We've got some difficulties with the publication date of each item 
because each site use an homemade date format.
The fact is that we want to have the exact amount of time between the 
date of publication and the time it is now.


So we decided to uses a timestamp that stores the index time for each item.

The problem is :

   * when i do a full-import&clean=false the index is always cleaned.
   * when i do a simple import, nothing seems to be done.

Here is the configuration :

   * Apache Solr 1.4 Nightly 2009-09-25
   * java version : build 1.6.0_15-b03
   * Java HotSpot Client VM : build 14.1-b02, mixed mode, sharing

=> data-config.xml



   
   
   http://www.capital.fr/rss2/feed/fil-bourse.xml";
   processor="XPathEntityProcessor"
   forEach="/rss/channel | /rss/channel/item"
   transformer="DateFormatTransformer, TemplateTransformer"
   onError="continue">
   
   
  
   

   
   xpath="/rss/channel/item/description" />
   dateTimeFormat="EEE, dd MMM  HH:mm:ss z" />

   
   


=> schema.xml

[...]

  
  
  
  
  default="NOW" />

  
  multiValued="true" />

  
  
  
  
  
 
  
  default="NOW" multiValued="false"/>




link

all_text


[...]

- Tests :

=> command=full-import&clean=false

25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties

INFO: Read dataimport.properties
25-Sep-2009 14:58:21 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=full-import} 
status=0 QTime=6
25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.DataImporter 
doFullImport

INFO: Starting Full Import
25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties

INFO: Read dataimport.properties
25-Sep-2009 14:58:21 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
25-Sep-2009 14:58:21 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=D:\srv\solr\index,segFN=segments_2s,version=1251453476028,generation=100,filenames=[segments_2s, 
_3u.

cfs, _3u.cfx]
25-Sep-2009 14:58:21 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1251453476028
25-Sep-2009 14:58:22 org.apache.solr.handler.dataimport.DocBuilder finish
INFO: Import completed successfully

=> command=import

25-Sep-2009 14:59:20 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=import} status=0 
QTime=0
25-Sep-2009 14:59:20 org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties

INFO: Read dataimport.properties

Any idea or suggestion ?
Thank you in advance!
--

Brahim Abdesslam
Directeur des opérations

* Maecia - /Développement web/ *
Mob : +33 (0)6 82 87 31 27
Tel  : +33 (0)9 54 99 29 59
Fax : +33 (0)9 59 99 29 59

http://www.maecia.com

OOM error during merge - index still ok?

2009-09-25 Thread Phillip Farber

  
Can I expect the index to be left in a usable state ofter an out of 
memory error during a merge or it it most likely to be corrupt?  I'd 
really hate to have to start this index build again from square one.  
Thanks.


Thanks,

Phil

---
Exception in thread "http-8080-Processor2505" 
java.lang.OutOfMemoryError: Java heap space
Exception in thread "RMI TCP Connection(131)-141.213.128.155" 
java.lang.OutOfMemoryError: Java heap space
Exception in thread 
"ContainerBackgroundProcessor[StandardEngine[Catalina]]" 
java.lang.OutOfMemoryError: Java heap space
Exception in thread "http-8080-Processor2537" 
java.lang.OutOfMemoryError: Java heap space
Exception in thread "http-8080-Processor2483" Exception in thread "RMI 
Scheduler(0)" java.lang.OutOfMemoryError: Java heap space

java.lang.OutOfMemoryError: Java heap space
Exception in thread "Lucene Merge Thread #202" 
org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.OutOfMemoryError: Java heap space
   at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
   at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)

Caused by: java.lang.OutOfMemoryError: Java heap space
Exception in thread "Lucene Merge Thread #266" 
org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.IllegalStateException: this writer hit an OutOfMemoryError; 
cannot merge
   at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351)
   at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315)
Caused by: java.lang.IllegalStateException: this writer hit an 
OutOfMemoryError; cannot merge

   at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4529)
   at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4512)
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4424)
   at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:235)
   at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291)
WARN: The method class 
org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.

WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
WARN: The method class 
org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.

WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.

What options would you recommend for the Sun JVM?

2009-09-25 Thread Jérôme Etévé

Hi solr addicts,

I know there's no one size fits all set of options for the sun JVM,
but I think It'd be useful to everyone to share your tips on using the
sun JVM with solr.

For instance, I recently figured out that setting the tenured
generation garbage collection to Concurrent mark and sweep (
-XX:+UseConcMarkSweepGC )  have dramatically decreased the amount of
time java hangs on tenured gen. garbage collecting. On my settings,
the old gen. garbage collection went from big time chunks of 1~2
second to multiple small slices of ~0.2 s.

As a result, the commits (hence the searcher drop/rebuild) are much
less painful from the application performance point of view.

What are the other options you would recommend?

Cheers!

Jerome.

-- 
Jerome Eteve.
http://www.eteve.net
jer...@eteve.net

Re: Highlighting on text fields

2009-09-25 Thread Avlesh Singh

I got the answer to my question.
The field needs to be "stored" (or "termVector" enabled) for highlighting to
work properly.

Cheers
Avlesh

On Fri, Sep 25, 2009 at 1:01 PM, Avlesh Singh  wrote:

> I am new to the whole highlighting API and have a few basic questions:
> I have a "text" type field defined as underneath:
> 
> 
> 
>  ignoreCase="true" expand="false"/>
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
>  ignoreCase="true" expand="true"/>
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
>
> And the schema field is associated as follows:
> 
>
> My query, q=text_entity_name:(foo bar)&hl=true&hl.fl=text_entity_name
> work fine for the search part but not for highlighting. The highlight named
> list is empty for each document returned back.
>
> I have a unique key defined. What am I missing? Do I need to store term
> vectors for highlighting to work properly?
>
> Cheers
> Avlesh
>

Using two Solr documents to represent one logical document/file

2009-09-25 Thread Peter Ledbrook


Hi,

I want to index both the contents of a document/file and metadata associated
with that document. Since I also want to update the content and metadata
indexes independently, I believe that I need to use two separate Solr
documents per real/logical document. The question I have is how do I merge
query results so that only one result is returned per real/logical document,
not per Solr document? In particular, I don't want to filter the results to
satisfy any "max results" constraint.

I have read that this can be achieved with a facet search. Is this the best
approach, or is there some alternative?

Thanks,

Peter
-- 
View this message in context: 
http://www.nabble.com/Using-two-Solr-documents-to-represent-one-logical-document-file-tp25609646p25609646.html
Sent from the Solr - User mailing list archive at Nabble.com.

problem with HTMLStripStandardTokenizerFactory

2009-09-25 Thread Kundig, Andreas

Hello

I can't bring HTMLStripStandardTokenizerFactory to remove the content of the 
style tag, as the documentation says it should.

A search for 'mso' returns a document where the search term only appears in the 
style tag (it's a word document saved as html). Here is the highlight returned 
by solr (by the way: the wrong word is highlighted).

"vetica;
\n\tpanose-1:2 11 5 4 2 2 2 2 2 
4;\n\tmso-font-charset:0;\n\tmso-generic-font-family:swiss;"

I am using solr 1.3. Here is how I configured the tokenizer in schema.xml


  






  
  







  


Am I doing something wrong?

thank you
Andréas Kündig

World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.

Re: Showcase: Facetted Search for Wine using Solr

2009-09-25 Thread Marian Steinbach

Hi Grant!

Thanks for the advidce, I added the link to the list.

Regards,

Marian


On Fri, Sep 25, 2009 at 5:14 AM, Grant Ingersoll  wrote:
> Hi Marian,
>
> Looks great!  Wish I could order some wine.  When you get a chance, please
> add the site to http://wiki.apache.org/solr/PublicServers!
>
> Cheers,
> Grant
>
> On Sep 24, 2009, at 11:51 AM, marian.steinbach wrote:
>
>> Hello everybody!
>>
>> The purpose of this mail is to say "thank you" to the creators of Solr
>> and to the community that supports it.
>>
>> We released our first project using Solr several weeks ago, after
>> having tested Solr for several months.
>>
>> The project I'm talking about is a product search for an online wine
>> shop (sorry, german user interface only):
>>
>>  http://www.koelner-weinkeller.de/index.php?id=sortiment
>>
>> Our client offers about 3000 different wines and other related products.
>>
>> Before we introduced Solr, the products have been searched via
>> complicated and slow SQL statements, with all kinds problems related
>> to that. No full text indexing, no stemming etc.
>>
>> We are happy to make use of several built-in features which solve
>> problems that bugged us: Facetted search, german accents and stemming
>> and synonyms beeing the most important ones.
>>
>> The surrounding website is TYPO3 driven. We integrated Solr by
>> creating our own frontend plugin which talks to the Solr webservice
>> (and we're very happy about the PHP output type!).
>>
>> I'd be glad about your comments.
>>
>> Cheers,
>>
>> Marian
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Highlighting on text fields

2009-09-25 Thread Avlesh Singh

I am new to the whole highlighting API and have a few basic questions:
I have a "text" type field defined as underneath:





















And the schema field is associated as follows:


My query, q=text_entity_name:(foo bar)&hl=true&hl.fl=text_entity_name
work fine for the search part but not for highlighting. The highlight named
list is empty for each document returned back.

I have a unique key defined. What am I missing? Do I need to store term
vectors for highlighting to work properly?

Cheers
Avlesh

Re: Unsubscribe from this mailing-list

2009-09-25 Thread Avlesh Singh

You seem to be desperate to get out of the Solr mailing list :)
Send an email to solr-user-unsubscr...@lucene.apache.org

Cheers
Avlesh

On Fri, Sep 25, 2009 at 11:54 AM, Rafeek Raja  wrote:

> Unsubscribe from this mailing-list
>

78 matches

Mail list logo