Re: Use cases for ReplicationHandler's backup facility?
2009/9/24 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: On Fri, Sep 25, 2009 at 4:57 AM, Chris Harris rygu...@gmail.com wrote: The ReplicationHandler (http://wiki.apache.org/solr/SolrReplication) has support for backups, which can be triggered in one of two ways: 1. in response to startup/commit/optimize events (specified through the backupAfter tag specified in the handler's requestHandler tag in solrconfig.xml) 2. by manually hitting http://master_host:port/solr/replication?command=backup These backups get placed in directories named, e.g. snapshot.20090924033521, inside the solr data directory. According to the docs, these backups are not necessary for replication to work. My question is: What use case *are* they meant to address? The first potential use case that came to mind was that maybe I would be able to restore my index from these snapshot directories should it ever become corrupted. (I could just do something like rm -r data; mv snapshot.20090924033521 data.) That appears not to be one of the intended use cases, though; if it were, then I imagine the snapshot directories would contain the entire index, whereas they seem to contain only deltas of one form or another. Yes, the only reason to take a backup should be for restoration/archival They should contain all the files required for the latest commit point. To be clear, you'd have to write your own code to make any kind of restore from these snapshot back directories possible, right? (That is, the handler itself doesn't implement any kind of restore, nor can you restore by using simple filesystem commands like cp -r or mv.) For example, the most straightforward case would be if you limited yourself to only doing backups after each optimize; that's straightforward in that each snapshot directory should contain all the segment files required for a particular point-in-time view of the index. However, it still wouldn't contain the Lucene segments_N file, and it seems like to implement an index restore you'd need to try to reconstitute that somehow.
Unsubscribe from this mailing-list
Unsubscribe from this mailing-list
Re: Solr http post performance seems slow - help?
This may or may not help but here goes :) When i was running performance tests i look a look at the simple post tool that comes with the solr examples. First i changed my schema.xml to fit my needs and then i deleted the old index so solr created a blank one when i started up. Then i had a had a process chew on my data and spit out xml files that are formatted similarly to the xml files that the SimplePostTool example uses. Next i used the simple Post tool to post the xml files to solr (60k-80k records per xml file). Each file only took a couple minutes to index this way. Comit and optimize after that (took less then 10 minutes) and after about 2.5 hrs i had indexed just under 8 milion records. This was on a 4 year old single core laptop using resin 3 as my servlet container. Hope this helps. On Fri, Sep 25, 2009 at 3:51 AM, Lance Norskog goks...@gmail.com wrote: In top, press the '1' key. This will give a list of the CPUs and how much load is on each. The display is otherwise a little weird for multi-cpu machines. But don't be surprised when Solr is I/O bound. The biggest fanciest RAID is often a better investment than CPUs. On one project we bought low-end rack servers come with 6-8 disk bays, filling them with 10k/15k RPM disks. On Wed, Sep 23, 2009 at 2:47 PM, Dan A. Dickey dan.dic...@savvis.net wrote: On Friday 11 September 2009 11:06:20 am Dan A. Dickey wrote: ... Our JBoss expert and I will be looking into why this might be occurring. Does anyone know of any JBoss related slowness with Solr? And does anyone have any other sort of suggestions to speed indexing performance? Thanks for your help all! I'll keep you up to date with further progress. Ok, further progress... just to keep any interested parties up to date and for the record... I'm finding that using the example jetty setup (will be switching very very soon to a real jetty installation) is about the fastest. Using several processes to send posts to Solr helps a lot, and we're seeing about 80 posts a second this way. We also stripped down JBoss to the bare bones and the Solr in it is running nearly as fast - about 50 posts a second. It was our previous JBoss configuration that was making it appear slow for some reason. We will be running more tests and spreading out the pre-index workload across more machines and more processes. In our case we were seeing the bottleneck being one machine running 18 processes. The 2 quad core xeon system is experiencing about a 25% cpu load. And I'm not certain, but I think this may be actually 25% of one of the 8 cores. So, there's *lots* of room for Solr to be doing more work there. -Dan -- Dan A. Dickey | Senior Software Engineer Savvis 10900 Hampshire Ave. S., Bloomington, MN 55438 Office: 952.852.4803 | Fax: 952.852.4951 E-mail: dan.dic...@savvis.net -- Lance Norskog goks...@gmail.com
Re: Unsubscribe from this mailing-list
You seem to be desperate to get out of the Solr mailing list :) Send an email to solr-user-unsubscr...@lucene.apache.org Cheers Avlesh On Fri, Sep 25, 2009 at 11:54 AM, Rafeek Raja rafeek.r...@gmail.com wrote: Unsubscribe from this mailing-list
Highlighting on text fields
I am new to the whole highlighting API and have a few basic questions: I have a text type field defined as underneath: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType And the schema field is associated as follows: field name=text_entity_name type=text indexed=true stored=false/ My query, q=text_entity_name:(foo bar)hl=truehl.fl=text_entity_name work fine for the search part but not for highlighting. The highlight named list is empty for each document returned back. I have a unique key defined. What am I missing? Do I need to store term vectors for highlighting to work properly? Cheers Avlesh
Re: Showcase: Facetted Search for Wine using Solr
Hi Grant! Thanks for the advidce, I added the link to the list. Regards, Marian On Fri, Sep 25, 2009 at 5:14 AM, Grant Ingersoll gsing...@apache.org wrote: Hi Marian, Looks great! Wish I could order some wine. When you get a chance, please add the site to http://wiki.apache.org/solr/PublicServers! Cheers, Grant On Sep 24, 2009, at 11:51 AM, marian.steinbach wrote: Hello everybody! The purpose of this mail is to say thank you to the creators of Solr and to the community that supports it. We released our first project using Solr several weeks ago, after having tested Solr for several months. The project I'm talking about is a product search for an online wine shop (sorry, german user interface only): http://www.koelner-weinkeller.de/index.php?id=sortiment Our client offers about 3000 different wines and other related products. Before we introduced Solr, the products have been searched via complicated and slow SQL statements, with all kinds problems related to that. No full text indexing, no stemming etc. We are happy to make use of several built-in features which solve problems that bugged us: Facetted search, german accents and stemming and synonyms beeing the most important ones. The surrounding website is TYPO3 driven. We integrated Solr by creating our own frontend plugin which talks to the Solr webservice (and we're very happy about the PHP output type!). I'd be glad about your comments. Cheers, Marian -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Using two Solr documents to represent one logical document/file
Hi, I want to index both the contents of a document/file and metadata associated with that document. Since I also want to update the content and metadata indexes independently, I believe that I need to use two separate Solr documents per real/logical document. The question I have is how do I merge query results so that only one result is returned per real/logical document, not per Solr document? In particular, I don't want to filter the results to satisfy any max results constraint. I have read that this can be achieved with a facet search. Is this the best approach, or is there some alternative? Thanks, Peter -- View this message in context: http://www.nabble.com/Using-two-Solr-documents-to-represent-one-logical-document-file-tp25609646p25609646.html Sent from the Solr - User mailing list archive at Nabble.com.
What options would you recommend for the Sun JVM?
Hi solr addicts, I know there's no one size fits all set of options for the sun JVM, but I think It'd be useful to everyone to share your tips on using the sun JVM with solr. For instance, I recently figured out that setting the tenured generation garbage collection to Concurrent mark and sweep ( -XX:+UseConcMarkSweepGC ) have dramatically decreased the amount of time java hangs on tenured gen. garbage collecting. On my settings, the old gen. garbage collection went from big time chunks of 1~2 second to multiple small slices of ~0.2 s. As a result, the commits (hence the searcher drop/rebuild) are much less painful from the application performance point of view. What are the other options you would recommend? Cheers! Jerome. -- Jerome Eteve. http://www.eteve.net jer...@eteve.net
DIH RSS 1.4 nightly 2009-09-25 full-importclean=false always clean and import command do nothing
Hello everybody, we are using Solr to index some RSS feeds for a news agregator application. We've got some difficulties with the publication date of each item because each site use an homemade date format. The fact is that we want to have the exact amount of time between the date of publication and the time it is now. So we decided to uses a timestamp that stores the index time for each item. The problem is : * when i do a full-importclean=false the index is always cleaned. * when i do a simple import, nothing seems to be done. Here is the configuration : * Apache Solr 1.4 Nightly 2009-09-25 * java version : build 1.6.0_15-b03 * Java HotSpot Client VM : build 14.1-b02, mixed mode, sharing = data-config.xml ?xml version=1.0 encoding=utf-8? dataConfig dataSource type=HttpDataSource / document entity name=flux_367 pk=link url=http://www.capital.fr/rss2/feed/fil-bourse.xml; processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer, TemplateTransformer onError=continue field column=source template=368 commonField=true / field column=type template=0 commonField=true / field column=title xpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description / field column=date xpath=/rss/channel/item/pubDate dateTimeFormat=EEE, dd MMM HH:mm:ss z / /entity /document /dataConfig = schema.xml [...] fields field name=source type=text indexed=true stored=true / field name=title type=text indexed=true stored=true / field name=link type=string indexed=true stored=true / field name=description type=html indexed=true stored=true / field name=date type=date indexed=true stored=true default=NOW / field name=type type=sint indexed=true stored=true / field name=all_text type=text indexed=true stored=false multiValued=true / copyField source=source dest=all_text / copyField source=title dest=all_text / copyField source=description dest=all_text / copyField source=date dest=all_text / copyField source=type dest=all_text / !-- Here, default is used to create a timestamp field indicating When each document was indexed. -- field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ /fields uniqueKeylink/uniqueKey defaultSearchFieldall_text/defaultSearchField solrQueryParser defaultOperator=OR/ [...] - Tests : = command=full-importclean=false 25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties 25-Sep-2009 14:58:21 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=6 25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import 25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties 25-Sep-2009 14:58:21 org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [] REMOVING ALL DOCUMENTS FROM INDEX 25-Sep-2009 14:58:21 org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=D:\srv\solr\index,segFN=segments_2s,version=1251453476028,generation=100,filenames=[segments_2s, _3u. cfs, _3u.cfx] 25-Sep-2009 14:58:21 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1251453476028 25-Sep-2009 14:58:22 org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully = command=import 25-Sep-2009 14:59:20 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/dataimport params={command=import} status=0 QTime=0 25-Sep-2009 14:59:20 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Any idea or suggestion ? Thank you in advance! -- Brahim Abdesslam Directeur des opérations * Maecia - /Développement web/ * Mob : +33 (0)6 82 87 31 27 Tel : +33 (0)9 54 99 29 59 Fax : +33 (0)9 59 99 29 59 http://www.maecia.com http://www.maecia.com
RE: Alphanumeric Wild Card Search Question
Hi Ken, I am using the WordDelimiterFilterFactory. I thought I needed it because I thought that's what gave me the control over the options of how the words are split and indexed? I did try taking it out completely, but that didn't seem to help. I'll try the analysis tool today. There has got to be a simple solution for this, but it is sure eluding me. Thanks, Adrian -Original Message- From: Ensdorf Ken [mailto:ensd...@zoominfo.com] Sent: Thursday, September 24, 2009 5:03 PM To: solr-user@lucene.apache.org Subject: RE: Alphanumeric Wild Card Search Question Here's my question: I have some products that I want to allow people to search for with wild cards. For example, if my product is YBM354, I'd like for users to be able to search on YBM*, YBM3*, YBM35* and for any of these searches to return that product. I've found that I can search for YBM* and get the product, just not the other combinations. Are you using WordDelimiterFilterFactory? That would explain this behavior. If so, do you need it - for the queries you describe you don't need that kind of tokenization. Also, have you played with the analysis tool on the admin page, it is a great help in debugging things like this. -Ken
RE: Alphanumeric Wild Card Search Question
In case it helps, here's what I have currently, but I've been messing with different options: filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnNumerics=0 preserveOriginal=1/ -Original Message- From: Carr, Adrian [mailto:adrian.c...@jtv.com] Sent: Friday, September 25, 2009 9:28 AM To: solr-user@lucene.apache.org Subject: RE: Alphanumeric Wild Card Search Question Hi Ken, I am using the WordDelimiterFilterFactory. I thought I needed it because I thought that's what gave me the control over the options of how the words are split and indexed? I did try taking it out completely, but that didn't seem to help. I'll try the analysis tool today. There has got to be a simple solution for this, but it is sure eluding me. Thanks, Adrian -Original Message- From: Ensdorf Ken [mailto:ensd...@zoominfo.com] Sent: Thursday, September 24, 2009 5:03 PM To: solr-user@lucene.apache.org Subject: RE: Alphanumeric Wild Card Search Question Here's my question: I have some products that I want to allow people to search for with wild cards. For example, if my product is YBM354, I'd like for users to be able to search on YBM*, YBM3*, YBM35* and for any of these searches to return that product. I've found that I can search for YBM* and get the product, just not the other combinations. Are you using WordDelimiterFilterFactory? That would explain this behavior. If so, do you need it - for the queries you describe you don't need that kind of tokenization. Also, have you played with the analysis tool on the admin page, it is a great help in debugging things like this. -Ken
Re: OOM error during merge - index still ok?
On Fri, Sep 25, 2009 at 8:20 AM, Phillip Farber pfar...@umich.edu wrote: Can I expect the index to be left in a usable state ofter an out of memory error during a merge or it it most likely to be corrupt? It should be in the state it was after the last successful commit. -Yonik http://www.lucidimagination.com I'd really hate to have to start this index build again from square one. Thanks. Thanks, Phil --- Exception in thread http-8080-Processor2505 java.lang.OutOfMemoryError: Java heap space Exception in thread RMI TCP Connection(131)-141.213.128.155 java.lang.OutOfMemoryError: Java heap space Exception in thread ContainerBackgroundProcessor[StandardEngine[Catalina]] java.lang.OutOfMemoryError: Java heap space Exception in thread http-8080-Processor2537 java.lang.OutOfMemoryError: Java heap space Exception in thread http-8080-Processor2483 Exception in thread RMI Scheduler(0) java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError: Java heap space Exception in thread Lucene Merge Thread #202 org.apache.lucene.index.MergePolicy$MergeException: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315) Caused by: java.lang.OutOfMemoryError: Java heap space Exception in thread Lucene Merge Thread #266 org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot merge at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:351) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:315) Caused by: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot merge at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4529) at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4512) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4424) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:235) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291) WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked. WARN: Please see http://www.slf4j.org/codes.html#release for an explanation. WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked. WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
Re: Can we point a Solr server to index directory dynamically at runtime..
Are you storing (in addition to indexing) your data? Perhaps you could turn off storage on data older than 7 days (requires reindexing), thus losing the ability to return snippets but cutting down on your storage space and server count. I've experienced 10x decrease in space requirements and a large boost in speed after cutting extraneous storage from Solr -- the stored data is mixed in with the index data and so it slows down searches. You could also put all 200G onto one Solr instance rather than 10 for 7days data, and accept that those searches will be slower. Michael On Fri, Sep 25, 2009 at 1:34 AM, Silent Surfer silentsurfe...@yahoo.comwrote: Hi, Thank you Michael and Chris for the response. Today after the mail from Michael, we tested with the dynamic loading of cores and it worked well. So we need to go with the hybrid approach of Multicore and Distributed searching. As per our testing, we found that a Solr instance with 20 GB of index(single index or spread across multiple cores) can provide better performance when compared to having a Solr instance say 40 (or) 50 GB of index (single index or index spread across cores). So the 200 GB of index on day 1 will be spread across 200/20=10 Solr salve instances. On day 2 data, 10 more Solr slave servers are required; Cumulative Solr Slave instances = 200*2/20=20 ... .. .. On day 30 data, 10 more Solr slave servers are required; Cumulative Solr Slave instances = 200*30/20=300 So with the above approach, we may need ~300 Solr slave instances, which becomes very unmanageable. But we know that most of the queries is for the past 1 week, i.e we definitely need 70 Solr Slaves containing the last 7 days worth of data up and running. Now for the rest of the 230 Solr instances, do we need to keep it running for the odd query,that can span across the 30 days of data (30*200 GB=6 TB data) which can come up only a couple of times a day. This linear increase of Solr servers with the retention period doesn't seems to be a very scalable solution. So we are looking for something more simpler approach to handle this scenario. Appreciate any further inputs/suggestions. Regards, sS --- On Fri, 9/25/09, Chris Hostetter hossman_luc...@fucit.org wrote: From: Chris Hostetter hossman_luc...@fucit.org Subject: Re: Can we point a Solr server to index directory dynamically at runtime.. To: solr-user@lucene.apache.org Date: Friday, September 25, 2009, 4:04 AM : Using a multicore approach, you could send a create a core named : 'core3weeksold' pointing to '/datadirs/3weeksold' command to a live Solr, : which would spin it up on the fly. Then you query it, and maybe keep it : spun up until it's not queried for 60 seconds or something, then send a : remove core 'core3weeksold' command. : See http://wiki.apache.org/solr/CoreAdmin#CoreAdminHandler . something that seems implicit in the question is what to do when the request spans all of the data ... this is where (in theory) distributed searching could help you out. index each days worth of data into it's own core, that makes it really easy to expire the old data (just UNLOAD and delete an entire core once it's more then 30 days old) if your user is only searching current dta then your app can directly query the core containing the most current data -- but if they want to query the last week, or last two weeks worth of data, you do a distributed request for all of the shards needed to search the appropriate amount of data. Between the ALIAS and SWAP commands it on the CoreAdmin screen it should be pretty easy have cores with names like today,1dayold,2dayold so that your app can configure simple shard params for all the perumations you'll need to query. -Hoss
Re: Parallel requests to Tomcat
Thank you Grant and Lance for your comments -- I've run into a separate snag which puts this on hold for a bit, but I'll return to finish digging into this and post my results. - Michael On Thu, Sep 24, 2009 at 9:23 PM, Lance Norskog goks...@gmail.com wrote: Are you on Java 5, 6 or 7? Each release sees some tweaking of the Java multithreading model as well as performance improvements (and bug fixes) in the Sun HotSpot runtime. You may be tripping over the TCP/IP multithreaded connection manager. You might wish to create each client thread with a separate socket. Also, here is a standard bit of benchmarking advice: include think time. This means that instead of sending requests constantly, each thread should time out for a few seconds before sending the next request. This simulates a user stopping and thinking before clicking the mouse again. This helps simulate the quantity of threads, etc. which are stopped and waiting at each stage of the request pipeline. As it is, you are trying to simulate the throughput behaviour without simulating the horizontal volume. (Benchmarking is much harder than it looks.) On Wed, Sep 23, 2009 at 9:43 AM, Grant Ingersoll gsing...@apache.org wrote: On Sep 23, 2009, at 12:09 PM, Michael wrote: On Wed, Sep 23, 2009 at 12:05 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Sep 23, 2009 at 11:47 AM, Michael solrco...@gmail.com wrote: If this were IO bound, wouldn't I see the same results when sending my 8 requests to 8 Tomcats? There's only one disk (well, RAM) whether I'm querying 8 processes or 8 threads in 1 process, right? Right - I was thinking IO bound at the Lucene Directory level - which synchronized in the past and led to poor concurrency. Buy your Solr version is recent enough to use the newer unsynchronized method by default (on non-windows) Ah, OK. So it looks like comparing to Jetty is my only next step. Although I'm not sure what I'm going to do based on the result of that test -- if Jetty behaves differently, then I still don't know why the heck Tomcat is behaving badly! :) Have you done any profiling to see where hotspots are? Have you looked at garbage collection? Do you have any full collections occurring? What garbage collector are you using? How often are you updating/committing, etc? -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- Lance Norskog goks...@gmail.com
RE: Mixed field types and boolean searching
No- there are various analyzers. StandardAnalyzer is geared toward searching bodies of text for interesting words - punctuation is ripped out. Other analyzers are more useful for concrete text. You may have to work at finding one that leaves punctuation in. My problem is not with the StandardAnalyzer per se, but more as to how dismax style queries are handled by the query parser when the different fields have different sets of ignored tokens or stop words. Say you want to use the contents of a text box in your app and query a field in Solr. The user enters A and B, so you map this to f1:A and f1:B. Now, if B is an ignored token in the f1 field for whatever reason, the query boils down to f1:A. Now imagine you want to allow the user's text to match multiple fields - as in any term can match any field, but all terms must match at least 1 field. So now you map the user's query to (f1:A OR f2:A) AND (f1:B OR f2:B). But if f2 does not ignore B, the query boils down to (f1:A OR f2:A) AND (f2:B). Now documents that could come back when you were only matching against the f1 field don't come back. This seems counter-intuitive - to be consistent, I would think the query should essentially be treated as (f1:A OR f2:A) AND (TRUE OR f2:B) - and thus a term that is a stop word or ignored token for any of the fields would be ignored across the board. So I guess what I'm asking is if there is a reason for the existing behavior, or is it just a fact-of-life of the query parser? Thanks! -Ken
Re: Faceted Search on Dynamic Fields?
Also, here is the field definition in the schema dynamicField name=*amp;STRING_NOT_ANALYZED_YES type=string indexed=true stored=true multiValued=true/ -- View this message in context: http://www.nabble.com/Faceted-Search-on-Dynamic-Fields--tp25612887p25612936.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr and Garbage Collection
Hi, Have you looked at tuning the garbage collection ? Take a look at the following articles http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot -camp-draft/ http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html Changing to the concurrent or throughput collector should help with the long pauses. Colin. -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: Friday, September 25, 2009 11:37 AM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: Re: Solr and Garbage Collection Right, now I'm giving it 12GB of heap memory. If I give it less (10GB) it throws the following exception: Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3 61) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72) at org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:3 52) at org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:2 67) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:2 07) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java :70) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand ler.java:169) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3 03) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 232) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler .java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl ection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11 4) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java: 835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22 6) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:4 42) On Fri, Sep 25, 2009 at 10:55 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Fri, Sep 25, 2009 at 9:30 AM, Jonathan Ariel ionat...@gmail.com wrote: Hi to all! Lately my solr servers seem to stop responding once in a while. I'm using solr 1.3. Of course I'm having more traffic on the servers. So I logged the Garbage Collection activity to check if it's because of that. It seems like 11% of the time the application runs, it is stopped because of GC. And some times the GC takes up to 10 seconds! Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon servers. My index is around 10GB and I'm giving to the instances 10GB of RAM. Bigger heaps lead to bigger GC pauses in general. Do you mean that you are giving the JVM a 10GB heap? Were you getting OOM exceptions with a smaller heap? -Yonik http://www.lucidimagination.com
RE: Solr and Garbage Collection
Give it even more memory. Lucene FieldCache is used to store non-tokenized single-value non-boolean (DocumentId - FieldValue) pairs, and it is used (in-full!) for instance for sorting query results. So that if you have 100,000,000 documents with specific heavily distributed field values (cardinality is high! Size is 100bytes!) you need 10,000,000,000 bytes for just this instance of FieldCache. GC does not play any role. FieldCache won't be GC-collected. -Fuad http://www.linkedin.com/in/liferay -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: September-25-09 11:37 AM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: Re: Solr and Garbage Collection Right, now I'm giving it 12GB of heap memory. If I give it less (10GB) it throws the following exception: Sep 5, 2009 7:18:32 PM org.apache.solr.common.SolrException log SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3 61 ) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72) at org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:3 52 ) at org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:2 67 ) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:185) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:2 07 ) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:104) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java :7 0) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand le r.java:169) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. ja va:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3 03 ) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 23 2) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler .j ava:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl ec tion.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11 4) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java: 83 5) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22 6) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:4 42 ) On Fri, Sep 25, 2009 at 10:55 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Fri, Sep 25, 2009 at 9:30 AM, Jonathan Ariel ionat...@gmail.com wrote: Hi to all! Lately my solr servers seem to stop responding once in a while. I'm using solr 1.3. Of course I'm having more traffic on the servers. So I logged the Garbage Collection activity to check if it's because of that. It seems like 11% of the time the application runs, it is stopped because of GC. And some times the GC takes up to 10 seconds! Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon servers. My index is around 10GB and I'm giving to the instances 10GB of RAM. Bigger heaps lead to bigger GC pauses in general. Do you mean that you are giving the JVM a 10GB heap? Were you getting OOM exceptions with a smaller heap? -Yonik http://www.lucidimagination.com
RE: Solr and Garbage Collection
You are saying that I should give more memory than 12GB? Yes. Look at this: SEVERE: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3 61 ) It can't find few (!!!) contiguous bytes for .createValue(...) It can't add (Field Value, Document ID) pair to an array. GC tuning won't help in this specific case... May be SOLR/Lucene core developers may WARM FieldCache at IndexReader opening time, in the future... to have early OOM... Avoiding faceting (and sorting) on such field will only postpone OOM to unpredictable date/time... -Fuad http://www.linkedin.com/in/liferay
Re: Solr and Garbage Collection
It won't really - it will just keep the JVM from wasting time resizing the heap on you. Since you know you need so much RAM anyway, no reason not to just pin it at what you need. Not going to help you much with GC though. Jonathan Ariel wrote: BTW why making them equal will lower the frequency of GC? On 9/25/09, Fuad Efendi f...@efendi.ca wrote: Bigger heaps lead to bigger GC pauses in general. Opposite viewpoint: 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second. To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!) Use -server option. -server option of JVM is 'native CPU code', I remember WebLogic 7 console with SUN JVM 1.3 not showing any GC (just horizontal line). -Fuad http://www.linkedin.com/in/liferay -- - Mark http://www.lucidimagination.com
Re: Faceted Search on Dynamic Fields?
On Fri, Sep 25, 2009 at 12:19 PM, Avlesh Singh avl...@gmail.com wrote: Faceting, as of now, can only be done of definitive field names. To further clarify, the fields you can facet on can include those defined by dynamic fields. You just must specify the exact field name when you facet. dynamicField name=*amp;STRING_NOT_ANALYZED_YES type=string indexed=true stored=true multiValued=true/ Did you really mean for the ampersand to be in the dynamic field name? I'd advise against this, and it could be the source of your problems (escaping the ampersand in your request, etc). What is the exact facet request you are sending? -Yonik http://www.lucidimagination.com
Re: Solr and Garbage Collection
-server option of JVM is 'native CPU code', I remember WebLogic 7 console with SUN JVM 1.3 not showing any GC (just horizontal line). Not sure what that is all about either. -server and -client are just two different versions of hotspot. The -server version is optimized for long running applications - it starts slower, and over time, it learns about your app and makes good throughput optimizations. The -client hotspot version works faster quicker, and does concentrate more on response than throughput. Better for desktop apps. -server is better for long lived server apps. Generally. Mark Miller wrote: It won't really - it will just keep the JVM from wasting time resizing the heap on you. Since you know you need so much RAM anyway, no reason not to just pin it at what you need. Not going to help you much with GC though. Jonathan Ariel wrote: BTW why making them equal will lower the frequency of GC? On 9/25/09, Fuad Efendi f...@efendi.ca wrote: Bigger heaps lead to bigger GC pauses in general. Opposite viewpoint: 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second. To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!) Use -server option. -server option of JVM is 'native CPU code', I remember WebLogic 7 console with SUN JVM 1.3 not showing any GC (just horizontal line). -Fuad http://www.linkedin.com/in/liferay -- - Mark http://www.lucidimagination.com
RE: Solr and Garbage Collection
30ms is not better or worse than 1s until you look at the service requirements. For many applications, it is worth dedicating 10% of your processing time to GC if that makes the worst-case pause short. On the other hand, my experience with the IBM JVM was that the maximum query rate was 2-3X better with the concurrent generational GC compared to any of their other GC algorithms, so we got the best throughput along with the shortest pauses. Solr garbage generation (for queries) seems to have two major components: per-request garbage and cache evictions. With a generational collector, these two are handled by separate parts of the collector. Per-request garbage should completely fit in the short-term heap (nursery), so that it can be collected rapidly and returned to use for further requests. If the nursery is too small, the per-request allocations will be made in tenured space and sit there until the next major GC. Cache evictions are almost always in long-term storage (tenured space) because an LRU algorithm guarantees that the garbage will be old. Check the growth rate of tenured space (under constant load, of course) while increasing the size of the nursery. That rate should drop when the nursery gets big enough, then not drop much further as it is increased more. After that, reduce the size of tenured space until major GCs start happening too often (a judgment call). A bigger tenured space means longer major GCs and thus longer pauses, so you don't want it oversized by too much. Also check the hit rates of your caches. If the hit rate is low, say 20% or less, make that cache much bigger or set it to zero. Either one will reduce the number of cache evictions. If you have an HTTP cache in front of Solr, zero may be the right choice, since the HTTP cache is cherry-picking the easily cacheable requests. Note that a commit nearly doubles the memory required, because you have two live Searcher objects with all their caches. Make sure you have headroom for a commit. If you want to test the tenured space usage, you must test with real world queries. Those are the only way to get accurate cache eviction rates. wunder -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: Friday, September 25, 2009 9:34 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection BTW why making them equal will lower the frequency of GC? On 9/25/09, Fuad Efendi f...@efendi.ca wrote: Bigger heaps lead to bigger GC pauses in general. Opposite viewpoint: 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second. To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!) Use -server option. -server option of JVM is 'native CPU code', I remember WebLogic 7 console with SUN JVM 1.3 not showing any GC (just horizontal line). -Fuad http://www.linkedin.com/in/liferay
Re: download pre-release nightly solr 1.4
markrmiller wrote: michael8 wrote: Hi, I know Solr 1.4 is going to be released any day now pending Lucene 2.9 release. Is there anywhere where one can download a pre-released nighly build of Solr 1.4 just for getting familiar with new features (e.g. field collapsing)? Thanks, Michael You can download nightlies here:http://people.apache.org/builds/lucene/solr/nightly/ field collapsing won't be in 1.4 though. You have to build from svn after applying the patch for that. -- - Mark http://www.lucidimagination.com Thanks for the info Mark. If field collapsing is a patch, can I apply the patch against 1.3 then? Thanks again. Michael -- View this message in context: http://www.nabble.com/download-pre-release-nightly-solr-1.4-tp25590281p25615553.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr and Garbage Collection
Walter Underwood wrote: 30ms is not better or worse than 1s until you look at the service requirements. For many applications, it is worth dedicating 10% of your processing time to GC if that makes the worst-case pause short. On the other hand, my experience with the IBM JVM was that the maximum query rate was 2-3X better with the concurrent generational GC compared to any of their other GC algorithms, so we got the best throughput along with the shortest pauses. With which collector? Since the very early JVM's, all GC is generational. Most of the collectors (other than the Serial Collector) also work concurrently. By default, they are concurrent on different generations, but you can add concurrency to the other generation with each now too. Solr garbage generation (for queries) seems to have two major components: per-request garbage and cache evictions. With a generational collector, these two are handled by separate parts of the collector. Different parts of the collector? Its a different collector depending on the generation. The young generation is collected with a copy collector. This is because almost all the objects in the young generation are likely dead, and a copy collector only needs to visit live objects. So its very efficient. The tenured generation uses something more along the lines of mark and sweep or mark and compact. Per-request garbage should completely fit in the short-term heap (nursery), so that it can be collected rapidly and returned to use for further requests. If the nursery is too small, the per-request allocations will be made in tenured space and sit there until the next major GC. Cache evictions are almost always in long-term storage (tenured space) because an LRU algorithm guarantees that the garbage will be old. Check the growth rate of tenured space (under constant load, of course) while increasing the size of the nursery. That rate should drop when the nursery gets big enough, then not drop much further as it is increased more. After that, reduce the size of tenured space until major GCs start happening too often (a judgment call). A bigger tenured space means longer major GCs and thus longer pauses, so you don't want it oversized by too much. With the concurrent low pause collector, the goal is to avoid major collections, by collecting *before* the tenured space is filled. If you you are getting major collections, you need to tune your settings - the whole point of that collector is to avoid major collections, and do almost all of the work while your application is not paused. There are still 2 brief pauses during the collection, but they should not be significant at all. Also check the hit rates of your caches. If the hit rate is low, say 20% or less, make that cache much bigger or set it to zero. Either one will reduce the number of cache evictions. If you have an HTTP cache in front of Solr, zero may be the right choice, since the HTTP cache is cherry-picking the easily cacheable requests. Note that a commit nearly doubles the memory required, because you have two live Searcher objects with all their caches. Make sure you have headroom for a commit. If you want to test the tenured space usage, you must test with real world queries. Those are the only way to get accurate cache eviction rates. wunder -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: Friday, September 25, 2009 9:34 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection BTW why making them equal will lower the frequency of GC? On 9/25/09, Fuad Efendi f...@efendi.ca wrote: Bigger heaps lead to bigger GC pauses in general. Opposite viewpoint: 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second. To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!) Use -server option. -server option of JVM is 'native CPU code', I remember WebLogic 7 console with SUN JVM 1.3 not showing any GC (just horizontal line). -Fuad http://www.linkedin.com/in/liferay -- - Mark http://www.lucidimagination.com
RE: Solr and Garbage Collection
As I said, I was using the IBM JVM, not the Sun JVM. The concurrent low pause collector is only in the Sun JVM. I just found this excellent article about the various IBM GC options for a Lucene application with a 100GB heap: http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large _h.html wunder -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, September 25, 2009 10:03 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection Walter Underwood wrote: 30ms is not better or worse than 1s until you look at the service requirements. For many applications, it is worth dedicating 10% of your processing time to GC if that makes the worst-case pause short. On the other hand, my experience with the IBM JVM was that the maximum query rate was 2-3X better with the concurrent generational GC compared to any of their other GC algorithms, so we got the best throughput along with the shortest pauses. With which collector? Since the very early JVM's, all GC is generational. Most of the collectors (other than the Serial Collector) also work concurrently. By default, they are concurrent on different generations, but you can add concurrency to the other generation with each now too. Solr garbage generation (for queries) seems to have two major components: per-request garbage and cache evictions. With a generational collector, these two are handled by separate parts of the collector. Different parts of the collector? Its a different collector depending on the generation. The young generation is collected with a copy collector. This is because almost all the objects in the young generation are likely dead, and a copy collector only needs to visit live objects. So its very efficient. The tenured generation uses something more along the lines of mark and sweep or mark and compact. Per-request garbage should completely fit in the short-term heap (nursery), so that it can be collected rapidly and returned to use for further requests. If the nursery is too small, the per-request allocations will be made in tenured space and sit there until the next major GC. Cache evictions are almost always in long-term storage (tenured space) because an LRU algorithm guarantees that the garbage will be old. Check the growth rate of tenured space (under constant load, of course) while increasing the size of the nursery. That rate should drop when the nursery gets big enough, then not drop much further as it is increased more. After that, reduce the size of tenured space until major GCs start happening too often (a judgment call). A bigger tenured space means longer major GCs and thus longer pauses, so you don't want it oversized by too much. With the concurrent low pause collector, the goal is to avoid major collections, by collecting *before* the tenured space is filled. If you you are getting major collections, you need to tune your settings - the whole point of that collector is to avoid major collections, and do almost all of the work while your application is not paused. There are still 2 brief pauses during the collection, but they should not be significant at all. Also check the hit rates of your caches. If the hit rate is low, say 20% or less, make that cache much bigger or set it to zero. Either one will reduce the number of cache evictions. If you have an HTTP cache in front of Solr, zero may be the right choice, since the HTTP cache is cherry-picking the easily cacheable requests. Note that a commit nearly doubles the memory required, because you have two live Searcher objects with all their caches. Make sure you have headroom for a commit. If you want to test the tenured space usage, you must test with real world queries. Those are the only way to get accurate cache eviction rates. wunder -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: Friday, September 25, 2009 9:34 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection BTW why making them equal will lower the frequency of GC? On 9/25/09, Fuad Efendi f...@efendi.ca wrote: Bigger heaps lead to bigger GC pauses in general. Opposite viewpoint: 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second. To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!) Use -server option. -server option of JVM is 'native CPU code', I remember WebLogic 7 console with SUN JVM 1.3 not showing any GC (just horizontal line). -Fuad http://www.linkedin.com/in/liferay -- - Mark http://www.lucidimagination.com
Re: Solr and Garbage Collection
Ok. I will try with the concurrent low pause collector and let you know the results. On Fri, Sep 25, 2009 at 2:23 PM, Walter Underwood wun...@wunderwood.orgwrote: As I said, I was using the IBM JVM, not the Sun JVM. The concurrent low pause collector is only in the Sun JVM. I just found this excellent article about the various IBM GC options for a Lucene application with a 100GB heap: http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large _h.html wunder -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, September 25, 2009 10:03 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection Walter Underwood wrote: 30ms is not better or worse than 1s until you look at the service requirements. For many applications, it is worth dedicating 10% of your processing time to GC if that makes the worst-case pause short. On the other hand, my experience with the IBM JVM was that the maximum query rate was 2-3X better with the concurrent generational GC compared to any of their other GC algorithms, so we got the best throughput along with the shortest pauses. With which collector? Since the very early JVM's, all GC is generational. Most of the collectors (other than the Serial Collector) also work concurrently. By default, they are concurrent on different generations, but you can add concurrency to the other generation with each now too. Solr garbage generation (for queries) seems to have two major components: per-request garbage and cache evictions. With a generational collector, these two are handled by separate parts of the collector. Different parts of the collector? Its a different collector depending on the generation. The young generation is collected with a copy collector. This is because almost all the objects in the young generation are likely dead, and a copy collector only needs to visit live objects. So its very efficient. The tenured generation uses something more along the lines of mark and sweep or mark and compact. Per-request garbage should completely fit in the short-term heap (nursery), so that it can be collected rapidly and returned to use for further requests. If the nursery is too small, the per-request allocations will be made in tenured space and sit there until the next major GC. Cache evictions are almost always in long-term storage (tenured space) because an LRU algorithm guarantees that the garbage will be old. Check the growth rate of tenured space (under constant load, of course) while increasing the size of the nursery. That rate should drop when the nursery gets big enough, then not drop much further as it is increased more. After that, reduce the size of tenured space until major GCs start happening too often (a judgment call). A bigger tenured space means longer major GCs and thus longer pauses, so you don't want it oversized by too much. With the concurrent low pause collector, the goal is to avoid major collections, by collecting *before* the tenured space is filled. If you you are getting major collections, you need to tune your settings - the whole point of that collector is to avoid major collections, and do almost all of the work while your application is not paused. There are still 2 brief pauses during the collection, but they should not be significant at all. Also check the hit rates of your caches. If the hit rate is low, say 20% or less, make that cache much bigger or set it to zero. Either one will reduce the number of cache evictions. If you have an HTTP cache in front of Solr, zero may be the right choice, since the HTTP cache is cherry-picking the easily cacheable requests. Note that a commit nearly doubles the memory required, because you have two live Searcher objects with all their caches. Make sure you have headroom for a commit. If you want to test the tenured space usage, you must test with real world queries. Those are the only way to get accurate cache eviction rates. wunder -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: Friday, September 25, 2009 9:34 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection BTW why making them equal will lower the frequency of GC? On 9/25/09, Fuad Efendi f...@efendi.ca wrote: Bigger heaps lead to bigger GC pauses in general. Opposite viewpoint: 1sec GC happening once an hour is MUCH BETTER than 30ms GC once-per-second. To lower frequency of GC: -Xms4096m -Xmx4096m (make it equal!) Use -server option. -server option of JVM is 'native CPU code', I remember WebLogic 7 console with SUN JVM 1.3 not showing any GC (just horizontal line). -Fuad http://www.linkedin.com/in/liferay -- - Mark http://www.lucidimagination.com
Re: Solr and Garbage Collection
My bad - later, it looks as if your giving general advice, and thats what I took issue with. Any Collector that is not doing generational collection is essentially from the dark ages and shouldn't be used. Any Collector that doesn't have concurrent options, unless possibly your running a tiny app (under 100MB of RAM), or only have a single CPU, is also dark ages, and not fit for a server environement. I havn't kept up with IBM's JVM, but it sounds like they are well behind Sun in GC then. - Mark Walter Underwood wrote: As I said, I was using the IBM JVM, not the Sun JVM. The concurrent low pause collector is only in the Sun JVM. I just found this excellent article about the various IBM GC options for a Lucene application with a 100GB heap: http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large _h.html wunder -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, September 25, 2009 10:03 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection Walter Underwood wrote: 30ms is not better or worse than 1s until you look at the service requirements. For many applications, it is worth dedicating 10% of your processing time to GC if that makes the worst-case pause short. On the other hand, my experience with the IBM JVM was that the maximum query rate was 2-3X better with the concurrent generational GC compared to any of their other GC algorithms, so we got the best throughput along with the shortest pauses. With which collector? Since the very early JVM's, all GC is generational. Most of the collectors (other than the Serial Collector) also work concurrently. By default, they are concurrent on different generations, but you can add concurrency to the other generation with each now too. Solr garbage generation (for queries) seems to have two major components: per-request garbage and cache evictions. With a generational collector, these two are handled by separate parts of the collector. Different parts of the collector? Its a different collector depending on the generation. The young generation is collected with a copy collector. This is because almost all the objects in the young generation are likely dead, and a copy collector only needs to visit live objects. So its very efficient. The tenured generation uses something more along the lines of mark and sweep or mark and compact. Per-request garbage should completely fit in the short-term heap (nursery), so that it can be collected rapidly and returned to use for further requests. If the nursery is too small, the per-request allocations will be made in tenured space and sit there until the next major GC. Cache evictions are almost always in long-term storage (tenured space) because an LRU algorithm guarantees that the garbage will be old. Check the growth rate of tenured space (under constant load, of course) while increasing the size of the nursery. That rate should drop when the nursery gets big enough, then not drop much further as it is increased more. After that, reduce the size of tenured space until major GCs start happening too often (a judgment call). A bigger tenured space means longer major GCs and thus longer pauses, so you don't want it oversized by too much. With the concurrent low pause collector, the goal is to avoid major collections, by collecting *before* the tenured space is filled. If you you are getting major collections, you need to tune your settings - the whole point of that collector is to avoid major collections, and do almost all of the work while your application is not paused. There are still 2 brief pauses during the collection, but they should not be significant at all. Also check the hit rates of your caches. If the hit rate is low, say 20% or less, make that cache much bigger or set it to zero. Either one will reduce the number of cache evictions. If you have an HTTP cache in front of Solr, zero may be the right choice, since the HTTP cache is cherry-picking the easily cacheable requests. Note that a commit nearly doubles the memory required, because you have two live Searcher objects with all their caches. Make sure you have headroom for a commit. If you want to test the tenured space usage, you must test with real world queries. Those are the only way to get accurate cache eviction rates. wunder -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: Friday, September 25, 2009 9:34 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection BTW why making them equal will lower the frequency of GC? On 9/25/09, Fuad Efendi f...@efendi.ca wrote: Bigger heaps lead to bigger GC pauses in general. Opposite viewpoint: 1sec GC happening once an hour
8 for 1.4
Y'all, We're down to 8 open issues: https://issues.apache.org/jira/secure/BrowseVersion.jspa?id=12310230versionId=12313351showOpenIssuesOnly=true 2 are packaging related, one is dependent on the official 2.9 release (so should be taken care of today or tomorrow I suspect) and then we have a few others. The only two somewhat major ones are S-1458, S-1294 (more on this in a mo') and S-1449. On S-1294, the SolrJS patch, I yet again have concerns about even including this, given the lack of activity (from Matthias, the original author and others) and the fact that some in the Drupal community have already forked this to fix the various bugs in it instead of just submitting patches. While I really like the idea of this library (jQuery is awesome), I have yet to see interest in the community to maintain it (unless you count someone forking it and fixing the bugs in the fork as maintenance) and I'll be upfront in admitting I have neither the time nor the patience to debug Javascript across the gazillions of browsers out there (I don't even have IE on my machine unless you count firing up a VM w/ XP on it) in the wild. Given what I know of most of the other committers here, I suspect that is true for others too. At a minimum, I think S-1294 should be pushed to 1.5. Next up, I think we consider pulling SolrJS from the release, but keeping it in trunk and officially releasing it with either 1.5 or 1.4.1, assuming its gotten some love in the meantime. If by then it has no love, I vote we remove it and let the fork maintain it and point people there. -Grant
RE: Solr and Garbage Collection
For batch-oriented computing, like Hadoop, the most efficient GC is probably a non-concurrent, non-generational GC. I doubt that there are many batch-oriented applications of Solr, though. The rest of the advice is intended to be general and it sounds like we agree about sizing. If the nursery is not big enough, the tenured space will be used for allocations that have a short lifetime and that will increase the length and/or frequency of major collections. Cache evictions are the interesting part, because they cause a constant rate of tenured space garbage. In most many servers, you can get a big enough nursery that major collections are very rare. That won't happen in Solr because of cache evictions. The IBM JVM is excellent. Their concurrent generational GC policy is gencon. wunder -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, September 25, 2009 10:31 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection My bad - later, it looks as if your giving general advice, and thats what I took issue with. Any Collector that is not doing generational collection is essentially from the dark ages and shouldn't be used. Any Collector that doesn't have concurrent options, unless possibly your running a tiny app (under 100MB of RAM), or only have a single CPU, is also dark ages, and not fit for a server environement. I havn't kept up with IBM's JVM, but it sounds like they are well behind Sun in GC then. - Mark Walter Underwood wrote: As I said, I was using the IBM JVM, not the Sun JVM. The concurrent low pause collector is only in the Sun JVM. I just found this excellent article about the various IBM GC options for a Lucene application with a 100GB heap: http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large _h.html wunder -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, September 25, 2009 10:03 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection Walter Underwood wrote: 30ms is not better or worse than 1s until you look at the service requirements. For many applications, it is worth dedicating 10% of your processing time to GC if that makes the worst-case pause short. On the other hand, my experience with the IBM JVM was that the maximum query rate was 2-3X better with the concurrent generational GC compared to any of their other GC algorithms, so we got the best throughput along with the shortest pauses. With which collector? Since the very early JVM's, all GC is generational. Most of the collectors (other than the Serial Collector) also work concurrently. By default, they are concurrent on different generations, but you can add concurrency to the other generation with each now too. Solr garbage generation (for queries) seems to have two major components: per-request garbage and cache evictions. With a generational collector, these two are handled by separate parts of the collector. Different parts of the collector? Its a different collector depending on the generation. The young generation is collected with a copy collector. This is because almost all the objects in the young generation are likely dead, and a copy collector only needs to visit live objects. So its very efficient. The tenured generation uses something more along the lines of mark and sweep or mark and compact. Per-request garbage should completely fit in the short-term heap (nursery), so that it can be collected rapidly and returned to use for further requests. If the nursery is too small, the per-request allocations will be made in tenured space and sit there until the next major GC. Cache evictions are almost always in long-term storage (tenured space) because an LRU algorithm guarantees that the garbage will be old. Check the growth rate of tenured space (under constant load, of course) while increasing the size of the nursery. That rate should drop when the nursery gets big enough, then not drop much further as it is increased more. After that, reduce the size of tenured space until major GCs start happening too often (a judgment call). A bigger tenured space means longer major GCs and thus longer pauses, so you don't want it oversized by too much. With the concurrent low pause collector, the goal is to avoid major collections, by collecting *before* the tenured space is filled. If you you are getting major collections, you need to tune your settings - the whole point of that collector is to avoid major collections, and do almost all of the work while your application is not paused. There are still 2 brief pauses during the collection, but they should not be significant at all. Also check the hit rates of your caches. If the hit rate is low, say 20% or less, make that cache much bigger or set it to zero.
Solr + Jboss + Custom Transformers
Hi I am trying to use a custom transformer that extends org.apache.solr.handler.dataimport.Transformer. I have the CustomTransformer.jar and DataImportHandler.jar in JBOSS/server/default/lib. I have the solr.war (as is from the distro) in the JBOSS/server/default/deploy. org.apache.solr.handler.dataimport.EntityProcessorWrapper (line 110) returns false for the following code clazz.newInstance() instanceof Transformer This happens because the CustomTransformer uses the Transformer from a different ClassLoader than the Solr web application. I could use the source code to create solr.war that includes the CustomTransformer class. Is there any other option - one that preferably does not include re-packaging solr.war ? Thanks Papiya Pink OTC Markets Inc. provides the leading inter-dealer quotation and trading system in the over-the-counter (OTC) securities market. We create innovative technology and data solutions to efficiently connect market participants, improve price discovery, increase issuer disclosure, and better inform investors. Our marketplace, comprised of the issuer-listed OTCQX and broker-quoted Pink Sheets, is the third largest U.S. equity trading venue for company shares. This document contains confidential information of Pink OTC Markets and is only intended for the recipient. Do not copy, reproduce (electronically or otherwise), or disclose without the prior written consent of Pink OTC Markets. If you receive this message in error, please destroy all copies in your possession (electronically or otherwise) and contact the sender above.
Re: Solr and Garbage Collection
Walter Underwood wrote: For batch-oriented computing, like Hadoop, the most efficient GC is probably a non-concurrent, non-generational GC. Okay - for batch we somewhat agree I guess - if you can stand any length of pausing, non concurrent can be nice, because you don't pay for thread sync communication. Only with a small heap size though (less than 100MB is what I've seen). You would pause the batch job while GC takes place. If you have 8 processors, and you are pausing all of them to collect a large heap using only 1 processor, that doesn't make much sense to me. The thread communication pain will be far outweighed by using more processors to do the collection faster, and not stop the world for your batch job so long. Stopping your application dead in its tracks, and then only using one of the available processors to collect a large heap, while the rest sit idle, doesn't make much sense. I also don't agree it ever really makes sense not to do generational collection. What is your argument here? Generational collection is **way** more efficient for short lived objects, which tend to be up to 98% of the objects in most applications. The only way I see that making sense is if you have almost no short lived objects (which occurs in what, .0001% of apps if at all?). The Sun JVM doesn't even offer a non generational approach anymore. It's just standard GC practice. I doubt that there are many batch-oriented applications of Solr, though. The rest of the advice is intended to be general and it sounds like we agree about sizing. If the nursery is not big enough, the tenured space will be used for allocations that have a short lifetime and that will increase the length and/or frequency of major collections. Yes - I wasn't arguing with every point - I was picking and choosing :) After the heap size, the size of the young generation is the most important factor. Cache evictions are the interesting part, because they cause a constant rate of tenured space garbage. In most many servers, you can get a big enough nursery that major collections are very rare. That won't happen in Solr because of cache evictions. The IBM JVM is excellent. Their concurrent generational GC policy is gencon. Yeah, I actually know very little about the IBM JVM, so I wasn't really commenting. But from the info I gleaned here and on a couple quick web searches, I'm not too impressed by it's GC. wunder -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, September 25, 2009 10:31 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection My bad - later, it looks as if your giving general advice, and thats what I took issue with. Any Collector that is not doing generational collection is essentially from the dark ages and shouldn't be used. Any Collector that doesn't have concurrent options, unless possibly your running a tiny app (under 100MB of RAM), or only have a single CPU, is also dark ages, and not fit for a server environement. I havn't kept up with IBM's JVM, but it sounds like they are well behind Sun in GC then. - Mark Walter Underwood wrote: As I said, I was using the IBM JVM, not the Sun JVM. The concurrent low pause collector is only in the Sun JVM. I just found this excellent article about the various IBM GC options for a Lucene application with a 100GB heap: http://www.nearinfinity.com/blogs/aaron_mccurry/tuning_the_ibm_jvm_for_large _h.html wunder -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, September 25, 2009 10:03 AM To: solr-user@lucene.apache.org Subject: Re: Solr and Garbage Collection Walter Underwood wrote: 30ms is not better or worse than 1s until you look at the service requirements. For many applications, it is worth dedicating 10% of your processing time to GC if that makes the worst-case pause short. On the other hand, my experience with the IBM JVM was that the maximum query rate was 2-3X better with the concurrent generational GC compared to any of their other GC algorithms, so we got the best throughput along with the shortest pauses. With which collector? Since the very early JVM's, all GC is generational. Most of the collectors (other than the Serial Collector) also work concurrently. By default, they are concurrent on different generations, but you can add concurrency to the other generation with each now too. Solr garbage generation (for queries) seems to have two major components: per-request garbage and cache evictions. With a generational collector, these two are handled by separate parts of the collector. Different parts of the collector? Its a different collector depending on the generation. The young generation is collected with a copy collector. This is because almost all the objects in the young
Re: FW: Solr and Garbage Collection
Faud, you didn't read the thread right. He is not having a problem with OOM. He got the OOM because he lowered the heap to try and help GC. He normally runs with a heap that can handle his FC. Please re-read the thread. You are confusing the tread. - Mark Fuad Efendi wrote: Guys, thanks for GC discussion; but the root of a problem is FieldCache internals. Not enough RAM for FieldCache will cause unpredictable OOM, and it does not depend on GC. How much RAM FieldCache needs in case of 2 different values for a Field, 200 bytes each (Unicode), and 100M documents? What if we have 100 such non-tokenized fields in a schema? SOLR has an option to warm up caches on startup which might help troubleshooting. JRockit JVM has 'realtime' version if you are interested in predictable GC (without delaying 'transaction')... GC will frequently happen even if RAM is more than enough: in case if it is heavily sparse... so that have even more RAM! -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: September-25-09 12:17 PM To: solr-user@lucene.apache.org Subject: RE: Solr and Garbage Collection You are saying that I should give more memory than 12GB? Yes. Look at this: SEVERE: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3 61 ) It can't find few (!!!) contiguous bytes for .createValue(...) It can't add (Field Value, Document ID) pair to an array. GC tuning won't help in this specific case... May be SOLR/Lucene core developers may WARM FieldCache at IndexReader opening time, in the future... to have early OOM... Avoiding faceting (and sorting) on such field will only postpone OOM to unpredictable date/time... -Fuad http://www.linkedin.com/in/liferay -- - Mark http://www.lucidimagination.com
FW: Solr and Garbage Collection
Guys, thanks for GC discussion; but the root of a problem is FieldCache internals. Not enough RAM for FieldCache will cause unpredictable OOM, and it does not depend on GC. How much RAM FieldCache needs in case of 2 different values for a Field, 200 bytes each (Unicode), and 100M documents? What if we have 100 such non-tokenized fields in a schema? SOLR has an option to warm up caches on startup which might help troubleshooting. JRockit JVM has 'realtime' version if you are interested in predictable GC (without delaying 'transaction')... GC will frequently happen even if RAM is more than enough: in case if it is heavily sparse... so that have even more RAM! -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: September-25-09 12:17 PM To: solr-user@lucene.apache.org Subject: RE: Solr and Garbage Collection You are saying that I should give more memory than 12GB? Yes. Look at this: SEVERE: java.lang.OutOfMemoryError: Java heap space org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:3 61 ) It can't find few (!!!) contiguous bytes for .createValue(...) It can't add (Field Value, Document ID) pair to an array. GC tuning won't help in this specific case... May be SOLR/Lucene core developers may WARM FieldCache at IndexReader opening time, in the future... to have early OOM... Avoiding faceting (and sorting) on such field will only postpone OOM to unpredictable date/time... -Fuad http://www.linkedin.com/in/liferay
RE: FW: Solr and Garbage Collection
Mark, what if piece of code needs 10 contiguous Kb to load a document field? How locked memory pieces are optimized/moved (putting on hold almost whole application)? Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize of live objects!!!) even if RAM is (theoretically) enough. -Fuad Faud, you didn't read the thread right. He is not having a problem with OOM. He got the OOM because he lowered the heap to try and help GC. He normally runs with a heap that can handle his FC. Please re-read the thread. You are confusing the tread. - Mark GC will frequently happen even if RAM is more than enough: in case if it is heavily sparse... so that have even more RAM! -Fuad
Hierarchical Facet Field Prefix Not Working
Hello all, We are using the patch from SOLR-64 (http://issues.apache.org/jira/browse/SOLR-64 ) to implement hierarchical facets for categories. We are trying to use the facet.prefix to prevent all categories from coming back. However, f.category.facet.prefix doesn't work. Using facet.prefix works but prevents the other facets from coming back since it is a global option. Are per facet options supported on hierarchical facet fields? If not, how can I get a specific category and it's children without getting the surrounding categories? Any help is much appreciated. Thank you, Nasseam Elkarra http://bodukai.com/boutique/ The fastest possible shopping experience.
Re: FW: Solr and Garbage Collection
I'm not planning on lowering the heap. I just want to lower the time wasted on GC, which is 11% right now.So what I'll try is changing the GC to -XX:+UseConcMarkSweepGC On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote: Mark, what if piece of code needs 10 contiguous Kb to load a document field? How locked memory pieces are optimized/moved (putting on hold almost whole application)? Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize of live objects!!!) even if RAM is (theoretically) enough. -Fuad Faud, you didn't read the thread right. He is not having a problem with OOM. He got the OOM because he lowered the heap to try and help GC. He normally runs with a heap that can handle his FC. Please re-read the thread. You are confusing the tread. - Mark GC will frequently happen even if RAM is more than enough: in case if it is heavily sparse... so that have even more RAM! -Fuad
RE: FW: Solr and Garbage Collection
But again, GC is not just Garbage Collection as many in this thread think... it is also memory defragmentation which is much costly than collection just because it needs move somewhere _live_objects_ (and wait/lock till such objects get unlocked to be moved...) - obviously more memory helps... 11% is extremely high. -Fuad http://www.linkedin.com/in/liferay -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: September-25-09 3:36 PM To: solr-user@lucene.apache.org Subject: Re: FW: Solr and Garbage Collection I'm not planning on lowering the heap. I just want to lower the time wasted on GC, which is 11% right now.So what I'll try is changing the GC to -XX:+UseConcMarkSweepGC On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote: Mark, what if piece of code needs 10 contiguous Kb to load a document field? How locked memory pieces are optimized/moved (putting on hold almost whole application)? Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize of live objects!!!) even if RAM is (theoretically) enough. -Fuad Faud, you didn't read the thread right. He is not having a problem with OOM. He got the OOM because he lowered the heap to try and help GC. He normally runs with a heap that can handle his FC. Please re-read the thread. You are confusing the tread. - Mark GC will frequently happen even if RAM is more than enough: in case if it is heavily sparse... so that have even more RAM! -Fuad
Re: FW: Solr and Garbage Collection
On Fri, Sep 25, 2009 at 2:52 PM, Fuad Efendi f...@efendi.ca wrote: Lowering heap helps GC? Yes. In general, lowering the heap can help or hurt. Hurt: if one is running very low on memory, GC will be working harder all of the time trying to find more memory and the % of time that GC takes can go up. Help: if one has massive heaps, full GCs may not happen as frequently, but when they do they can be larger and cause more of a problem. For many apps, a .2 second pause every minute is preferable to a 10 second pause every hour. And of course the other reason to lower the heap size *if* you don't need it that big is to leave more memory for other stuff, and for the OS itself to cache the index files. -Yonik http://www.lucidimagination.com
Re: FW: Solr and Garbage Collection
Maybe what's missing here is how did I get the 11%.I just ran solr with the following JVM params: -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of time the application run between collection pauses and the length of the collection pauses, respectively. I think that in this case the 11% is just for memory collection and not defragmentation... but I'm not 100% sure. On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi f...@efendi.ca wrote: But again, GC is not just Garbage Collection as many in this thread think... it is also memory defragmentation which is much costly than collection just because it needs move somewhere _live_objects_ (and wait/lock till such objects get unlocked to be moved...) - obviously more memory helps... 11% is extremely high. -Fuad http://www.linkedin.com/in/liferay -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: September-25-09 3:36 PM To: solr-user@lucene.apache.org Subject: Re: FW: Solr and Garbage Collection I'm not planning on lowering the heap. I just want to lower the time wasted on GC, which is 11% right now.So what I'll try is changing the GC to -XX:+UseConcMarkSweepGC On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote: Mark, what if piece of code needs 10 contiguous Kb to load a document field? How locked memory pieces are optimized/moved (putting on hold almost whole application)? Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize of live objects!!!) even if RAM is (theoretically) enough. -Fuad Faud, you didn't read the thread right. He is not having a problem with OOM. He got the OOM because he lowered the heap to try and help GC. He normally runs with a heap that can handle his FC. Please re-read the thread. You are confusing the tread. - Mark GC will frequently happen even if RAM is more than enough: in case if it is heavily sparse... so that have even more RAM! -Fuad
Re: Can we point a Solr server to index directory dynamically at runtime..
Hi Michael, We are storing all our data in addition to index, as we need to display those values to the user. So unfortunately we cannot go with the option stored=false, which could have potentially solved our issue. Appreciate any other pointers/suggestions Thanks, sS --- On Fri, 9/25/09, Michael solrco...@gmail.com wrote: From: Michael solrco...@gmail.com Subject: Re: Can we point a Solr server to index directory dynamically at runtime.. To: solr-user@lucene.apache.org Date: Friday, September 25, 2009, 2:00 PM Are you storing (in addition to indexing) your data? Perhaps you could turn off storage on data older than 7 days (requires reindexing), thus losing the ability to return snippets but cutting down on your storage space and server count. I've experienced 10x decrease in space requirements and a large boost in speed after cutting extraneous storage from Solr -- the stored data is mixed in with the index data and so it slows down searches. You could also put all 200G onto one Solr instance rather than 10 for 7days data, and accept that those searches will be slower. Michael On Fri, Sep 25, 2009 at 1:34 AM, Silent Surfer silentsurfe...@yahoo.comwrote: Hi, Thank you Michael and Chris for the response. Today after the mail from Michael, we tested with the dynamic loading of cores and it worked well. So we need to go with the hybrid approach of Multicore and Distributed searching. As per our testing, we found that a Solr instance with 20 GB of index(single index or spread across multiple cores) can provide better performance when compared to having a Solr instance say 40 (or) 50 GB of index (single index or index spread across cores). So the 200 GB of index on day 1 will be spread across 200/20=10 Solr salve instances. On day 2 data, 10 more Solr slave servers are required; Cumulative Solr Slave instances = 200*2/20=20 ... .. .. On day 30 data, 10 more Solr slave servers are required; Cumulative Solr Slave instances = 200*30/20=300 So with the above approach, we may need ~300 Solr slave instances, which becomes very unmanageable. But we know that most of the queries is for the past 1 week, i.e we definitely need 70 Solr Slaves containing the last 7 days worth of data up and running. Now for the rest of the 230 Solr instances, do we need to keep it running for the odd query,that can span across the 30 days of data (30*200 GB=6 TB data) which can come up only a couple of times a day. This linear increase of Solr servers with the retention period doesn't seems to be a very scalable solution. So we are looking for something more simpler approach to handle this scenario. Appreciate any further inputs/suggestions. Regards, sS --- On Fri, 9/25/09, Chris Hostetter hossman_luc...@fucit.org wrote: From: Chris Hostetter hossman_luc...@fucit.org Subject: Re: Can we point a Solr server to index directory dynamically at runtime.. To: solr-user@lucene.apache.org Date: Friday, September 25, 2009, 4:04 AM : Using a multicore approach, you could send a create a core named : 'core3weeksold' pointing to '/datadirs/3weeksold' command to a live Solr, : which would spin it up on the fly. Then you query it, and maybe keep it : spun up until it's not queried for 60 seconds or something, then send a : remove core 'core3weeksold' command. : See http://wiki.apache.org/solr/CoreAdmin#CoreAdminHandler . something that seems implicit in the question is what to do when the request spans all of the data ... this is where (in theory) distributed searching could help you out. index each days worth of data into it's own core, that makes it really easy to expire the old data (just UNLOAD and delete an entire core once it's more then 30 days old) if your user is only searching current dta then your app can directly query the core containing the most current data -- but if they want to query the last week, or last two weeks worth of data, you do a distributed request for all of the shards needed to search the appropriate amount of data. Between the ALIAS and SWAP commands it on the CoreAdmin screen it should be pretty easy have cores with names like today,1dayold,2dayold so that your app can configure simple shard params for all the perumations you'll need to query. -Hoss
Re: FW: Solr and Garbage Collection
When we talk about Collectors, we are not just talking about collecting - whatever that means. There isn't really a collecting phase - the whole algorithm is garbage collecting - hence calling the different implementations collectors. Usually, fragmentation is dealt with using a mark-compact collector (or IBM has used a mark-sweep-compact collector). Copying collectors are not only super efficient at collecting young spaces, but they are also great for fragmentation - when you copy everything to the new space, you can remove any fragmentation. At the cost of double the space requirements though. So mark-compact is a compromise. First you mark whats reachable, then everything thats marked is copied/compacted to the bottom of the heap. Its all part of a collection though. Jonathan Ariel wrote: Maybe what's missing here is how did I get the 11%.I just ran solr with the following JVM params: -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of time the application run between collection pauses and the length of the collection pauses, respectively. I think that in this case the 11% is just for memory collection and not defragmentation... but I'm not 100% sure. On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi f...@efendi.ca wrote: But again, GC is not just Garbage Collection as many in this thread think... it is also memory defragmentation which is much costly than collection just because it needs move somewhere _live_objects_ (and wait/lock till such objects get unlocked to be moved...) - obviously more memory helps... 11% is extremely high. -Fuad http://www.linkedin.com/in/liferay -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: September-25-09 3:36 PM To: solr-user@lucene.apache.org Subject: Re: FW: Solr and Garbage Collection I'm not planning on lowering the heap. I just want to lower the time wasted on GC, which is 11% right now.So what I'll try is changing the GC to -XX:+UseConcMarkSweepGC On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote: Mark, what if piece of code needs 10 contiguous Kb to load a document field? How locked memory pieces are optimized/moved (putting on hold almost whole application)? Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize of live objects!!!) even if RAM is (theoretically) enough. -Fuad Faud, you didn't read the thread right. He is not having a problem with OOM. He got the OOM because he lowered the heap to try and help GC. He normally runs with a heap that can handle his FC. Please re-read the thread. You are confusing the tread. - Mark GC will frequently happen even if RAM is more than enough: in case if it is heavily sparse... so that have even more RAM! -Fuad -- - Mark http://www.lucidimagination.com
solr home
I already have a handful of solr instances running . However, I'm trying to install solr (1.4) on a new linux server with tomcat using a context file (same way I usually do): Context docBase=/opt/local/solr/apache-solr-1.4.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=/opt/local/solr/fedora_solr/ override=true/ /Context However it throws an exception due to the following: SEVERE: Could not start SOLR. Check solr/home property java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in classpath or 'solr/conf/', cwd=/opt/local/solr/fedora_solr at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader. java:198) at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.ja va:166) Any ideas why this is happening? Thanks, Mike
Re: Solr and Garbage Collection
Ok. I'll first change the GC and see if the time spent decreased. Than I'll try increasing the heap as Fuad recommends. On 9/25/09, Mark Miller markrmil...@gmail.com wrote: When we talk about Collectors, we are not just talking about collecting - whatever that means. There isn't really a collecting phase - the whole algorithm is garbage collecting - hence calling the different implementations collectors. Usually, fragmentation is dealt with using a mark-compact collector (or IBM has used a mark-sweep-compact collector). Copying collectors are not only super efficient at collecting young spaces, but they are also great for fragmentation - when you copy everything to the new space, you can remove any fragmentation. At the cost of double the space requirements though. So mark-compact is a compromise. First you mark whats reachable, then everything thats marked is copied/compacted to the bottom of the heap. Its all part of a collection though. Jonathan Ariel wrote: Maybe what's missing here is how did I get the 11%.I just ran solr with the following JVM params: -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of time the application run between collection pauses and the length of the collection pauses, respectively. I think that in this case the 11% is just for memory collection and not defragmentation... but I'm not 100% sure. On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi f...@efendi.ca wrote: But again, GC is not just Garbage Collection as many in this thread think... it is also memory defragmentation which is much costly than collection just because it needs move somewhere _live_objects_ (and wait/lock till such objects get unlocked to be moved...) - obviously more memory helps... 11% is extremely high. -Fuad http://www.linkedin.com/in/liferay -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: September-25-09 3:36 PM To: solr-user@lucene.apache.org Subject: Re: FW: Solr and Garbage Collection I'm not planning on lowering the heap. I just want to lower the time wasted on GC, which is 11% right now.So what I'll try is changing the GC to -XX:+UseConcMarkSweepGC On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote: Mark, what if piece of code needs 10 contiguous Kb to load a document field? How locked memory pieces are optimized/moved (putting on hold almost whole application)? Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize of live objects!!!) even if RAM is (theoretically) enough. -Fuad Faud, you didn't read the thread right. He is not having a problem with OOM. He got the OOM because he lowered the heap to try and help GC. He normally runs with a heap that can handle his FC. Please re-read the thread. You are confusing the tread. - Mark GC will frequently happen even if RAM is more than enough: in case if it is heavily sparse... so that have even more RAM! -Fuad -- - Mark http://www.lucidimagination.com
Re: FW: Solr and Garbage Collection
or IBM has used a mark-sweep-compact collector Never mind - Sun's is also sometimes referred to as mark-sweep-compact. I've just seen it referred to as mark-compact before as well. In either case though, without some sort of sweep phase, there is no reclamation of memory :) It's interesting though - in the days of the early JVM's Sun talked more about compaction - but if you look at their recent info, they don't even mention it, or give you params to messs with it. They just talk about the mark and the sweep phase. IBM is much more open about a compaction phase, and not only do they give controls to tune it, they let you turn it off completely. Not sure what Sun is doing with compaction these days - or if they just work with fragmentation avoidance techniques instead - haven't seen any info on it. Mark Miller wrote: When we talk about Collectors, we are not just talking about collecting - whatever that means. There isn't really a collecting phase - the whole algorithm is garbage collecting - hence calling the different implementations collectors. Usually, fragmentation is dealt with using a mark-compact collector (or IBM has used a mark-sweep-compact collector). Copying collectors are not only super efficient at collecting young spaces, but they are also great for fragmentation - when you copy everything to the new space, you can remove any fragmentation. At the cost of double the space requirements though. So mark-compact is a compromise. First you mark whats reachable, then everything thats marked is copied/compacted to the bottom of the heap. Its all part of a collection though. Jonathan Ariel wrote: Maybe what's missing here is how did I get the 11%.I just ran solr with the following JVM params: -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime with that I can measure the amount of time the application run between collection pauses and the length of the collection pauses, respectively. I think that in this case the 11% is just for memory collection and not defragmentation... but I'm not 100% sure. On Fri, Sep 25, 2009 at 5:05 PM, Fuad Efendi f...@efendi.ca wrote: But again, GC is not just Garbage Collection as many in this thread think... it is also memory defragmentation which is much costly than collection just because it needs move somewhere _live_objects_ (and wait/lock till such objects get unlocked to be moved...) - obviously more memory helps... 11% is extremely high. -Fuad http://www.linkedin.com/in/liferay -Original Message- From: Jonathan Ariel [mailto:ionat...@gmail.com] Sent: September-25-09 3:36 PM To: solr-user@lucene.apache.org Subject: Re: FW: Solr and Garbage Collection I'm not planning on lowering the heap. I just want to lower the time wasted on GC, which is 11% right now.So what I'll try is changing the GC to -XX:+UseConcMarkSweepGC On Fri, Sep 25, 2009 at 4:17 PM, Fuad Efendi f...@efendi.ca wrote: Mark, what if piece of code needs 10 contiguous Kb to load a document field? How locked memory pieces are optimized/moved (putting on hold almost whole application)? Lowering heap is _bad_ idea; we will have extremely frequent GC (optimize of live objects!!!) even if RAM is (theoretically) enough. -Fuad Faud, you didn't read the thread right. He is not having a problem with OOM. He got the OOM because he lowered the heap to try and help GC. He normally runs with a heap that can handle his FC. Please re-read the thread. You are confusing the tread. - Mark GC will frequently happen even if RAM is more than enough: in case if it is heavily sparse... so that have even more RAM! -Fuad -- - Mark http://www.lucidimagination.com
Re: Solr and Garbage Collection
On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote: Hi to all! Lately my solr servers seem to stop responding once in a while. I'm using solr 1.3. Of course I'm having more traffic on the servers. So I logged the Garbage Collection activity to check if it's because of that. It seems like 11% of the time the application runs, it is stopped because of GC. And some times the GC takes up to 10 seconds! Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon servers. My index is around 10GB and I'm giving to the instances 10GB of RAM. How can I check which is the GC that it is being used? If I'm right JVM Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have any recommendation on this? As I said in Eteve's thread on JVM settings, some extra time spent on application design/debugging will save a whole lot of headache in Garbage Collection and trying to tune the gazillion different options available. Ask yourself: What is on the heap and does it need to be there? For instance, do you, if you have them, really need sortable ints? If your servers seem to come to a stop, I'm going to bet you have major collections going on. Major collections in a production system are very bad. They tend to happen right after commits in poorly tuned systems, but can also happen in other places if you let things build up due to really large heaps and/or things like really large cache settings. I would pull up jConsole and have a look at what is happening when the pauses occur. Is it a major collection? If so, then hook up a heap analyzer or a profiler and see what is on the heap around those times. Then have a look at your schema/config, etc. and see if there are things that are memory intensive (sorting, faceting, excessively large filter caches). -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
RE: FW: Solr and Garbage Collection
Usually, fragmentation is dealt with using a mark-compact collector (or IBM has used a mark-sweep-compact collector). Copying collectors are not only super efficient at collecting young spaces, but they are also great for fragmentation - when you copy everything to the new space, you can remove any fragmentation. At the cost of double the space requirements though. So that if memory size is optimized (application specific!) no any object copy will ever happen, although it is server-loading specific too (application-usage-specific; what do they do most frequently?) - just statistics, need to monitor JVM and make decision. Few years ago I had hard time explaining to client that byte array should be Base64 encoded instead of just byte123/byte... instead of GC tuning... SOLR uses XML; try to upload big XML - each Element instance needs at least 100 bytes... try to create array of 20M of Elements (parser will do!)... so that any GC tuning is application-usage specific too... RAM allocation and GC tuning is usage-specific, not SOLR-specific...
Re: Solr and Garbage Collection
Jonathan Ariel wrote: How can I check which is the GC that it is being used? If I'm right JVM Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have any recommendation on this? Just to straighten out this one too - Ergonomics doesn't use throughput - throughput is the collector that allows Ergonomics ;) And throughput is the default as long as your machine is detected as server class. But throughput is not great with large tenured spaces out of the box. It only parallelizes the new space collection. You have to turn on an option to get parallel tenured collection as well - which is essential to scale to large heap sizes. -- - Mark http://www.lucidimagination.com
Re: Solr and Garbage Collection
Mark Miller wrote: Jonathan Ariel wrote: How can I check which is the GC that it is being used? If I'm right JVM Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have any recommendation on this? Just to straighten out this one too - Ergonomics doesn't use throughput - throughput is the collector that allows Ergonomics ;) And throughput is the default as long as your machine is detected as server class. But throughput is not great with large tenured spaces out of the box. It only parallelizes the new space collection. You have to turn on an option to get parallel tenured collection as well - which is essential to scale to large heap sizes. hmm - I'm not being totally accurate there - ergonomics is what detects server and so makes throughput the default collector for a server machine. But much of the GC ergonomics support only works with the throughput collector. Kind of chicken and egg :) -- - Mark http://www.lucidimagination.com
Re: Solr and Garbage Collection
Thats a good point too - if you can reduce your need for such a large heap, by all means, do so. However, considering you already need at least 10GB or you get OOM, you have a long way to go with that approach. Good luck :) How many docs do you have ? I'm guessing its mostly FieldCache type stuff, and thats the type of thing you can't really side step, unless you give up the functionality thats using it. Grant Ingersoll wrote: On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote: Hi to all! Lately my solr servers seem to stop responding once in a while. I'm using solr 1.3. Of course I'm having more traffic on the servers. So I logged the Garbage Collection activity to check if it's because of that. It seems like 11% of the time the application runs, it is stopped because of GC. And some times the GC takes up to 10 seconds! Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon servers. My index is around 10GB and I'm giving to the instances 10GB of RAM. How can I check which is the GC that it is being used? If I'm right JVM Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have any recommendation on this? As I said in Eteve's thread on JVM settings, some extra time spent on application design/debugging will save a whole lot of headache in Garbage Collection and trying to tune the gazillion different options available. Ask yourself: What is on the heap and does it need to be there? For instance, do you, if you have them, really need sortable ints? If your servers seem to come to a stop, I'm going to bet you have major collections going on. Major collections in a production system are very bad. They tend to happen right after commits in poorly tuned systems, but can also happen in other places if you let things build up due to really large heaps and/or things like really large cache settings. I would pull up jConsole and have a look at what is happening when the pauses occur. Is it a major collection? If so, then hook up a heap analyzer or a profiler and see what is on the heap around those times. Then have a look at your schema/config, etc. and see if there are things that are memory intensive (sorting, faceting, excessively large filter caches). -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- - Mark http://www.lucidimagination.com
Re: Solr and Garbage Collection
One more point and I'll stop - I've hit my email quota for the day ;) While its a pain to have to juggle GC params and tune - when you require a heap thats more than a gig or two, I personally believe its essential to do so for good performance. The (default settings / ergonomics with throughput) just don't cut it. Sad fact of life :) Luckily, you don't generally have to do that much to get things nice - the number of options is not that staggering, and you don't usually need to get into most of them. Choosing the right collector, and tweaking a setting or two can often be enough. The most important to do with a large heap and the throughput collector is to turn on parallel tenured collection. I've said it before, but it really is key. At least if you have more than a processor or two - which, for your sake, I hope you do :) - Mark Mark Miller wrote: Thats a good point too - if you can reduce your need for such a large heap, by all means, do so. However, considering you already need at least 10GB or you get OOM, you have a long way to go with that approach. Good luck :) How many docs do you have ? I'm guessing its mostly FieldCache type stuff, and thats the type of thing you can't really side step, unless you give up the functionality thats using it. Grant Ingersoll wrote: On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote: Hi to all! Lately my solr servers seem to stop responding once in a while. I'm using solr 1.3. Of course I'm having more traffic on the servers. So I logged the Garbage Collection activity to check if it's because of that. It seems like 11% of the time the application runs, it is stopped because of GC. And some times the GC takes up to 10 seconds! Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon servers. My index is around 10GB and I'm giving to the instances 10GB of RAM. How can I check which is the GC that it is being used? If I'm right JVM Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have any recommendation on this? As I said in Eteve's thread on JVM settings, some extra time spent on application design/debugging will save a whole lot of headache in Garbage Collection and trying to tune the gazillion different options available. Ask yourself: What is on the heap and does it need to be there? For instance, do you, if you have them, really need sortable ints? If your servers seem to come to a stop, I'm going to bet you have major collections going on. Major collections in a production system are very bad. They tend to happen right after commits in poorly tuned systems, but can also happen in other places if you let things build up due to really large heaps and/or things like really large cache settings. I would pull up jConsole and have a look at what is happening when the pauses occur. Is it a major collection? If so, then hook up a heap analyzer or a profiler and see what is on the heap around those times. Then have a look at your schema/config, etc. and see if there are things that are memory intensive (sorting, faceting, excessively large filter caches). -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- - Mark http://www.lucidimagination.com
Re: Solr and Garbage Collection
I have around 8M documents. I set up my server to use a different collector and it seems like it decreased from 11% to 4%, of course I need to wait a bit more because it is just a 1 hour old log. But it seems like it is much better now. I will tell you on Monday the results :) On Fri, Sep 25, 2009 at 6:07 PM, Mark Miller markrmil...@gmail.com wrote: Thats a good point too - if you can reduce your need for such a large heap, by all means, do so. However, considering you already need at least 10GB or you get OOM, you have a long way to go with that approach. Good luck :) How many docs do you have ? I'm guessing its mostly FieldCache type stuff, and thats the type of thing you can't really side step, unless you give up the functionality thats using it. Grant Ingersoll wrote: On Sep 25, 2009, at 9:30 AM, Jonathan Ariel wrote: Hi to all! Lately my solr servers seem to stop responding once in a while. I'm using solr 1.3. Of course I'm having more traffic on the servers. So I logged the Garbage Collection activity to check if it's because of that. It seems like 11% of the time the application runs, it is stopped because of GC. And some times the GC takes up to 10 seconds! Is is normal? My instances run on a 16GB RAM, Dual Quad Core Intel Xeon servers. My index is around 10GB and I'm giving to the instances 10GB of RAM. How can I check which is the GC that it is being used? If I'm right JVM Ergonomics should use the Throughput GC, but I'm not 100% sure. Do you have any recommendation on this? As I said in Eteve's thread on JVM settings, some extra time spent on application design/debugging will save a whole lot of headache in Garbage Collection and trying to tune the gazillion different options available. Ask yourself: What is on the heap and does it need to be there? For instance, do you, if you have them, really need sortable ints? If your servers seem to come to a stop, I'm going to bet you have major collections going on. Major collections in a production system are very bad. They tend to happen right after commits in poorly tuned systems, but can also happen in other places if you let things build up due to really large heaps and/or things like really large cache settings. I would pull up jConsole and have a look at what is happening when the pauses occur. Is it a major collection? If so, then hook up a heap analyzer or a profiler and see what is on the heap around those times. Then have a look at your schema/config, etc. and see if there are things that are memory intensive (sorting, faceting, excessively large filter caches). -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- - Mark http://www.lucidimagination.com
RE: Solr and Garbage Collection
Sorry for OFF-topic: Create dummy Hello, World! JSP, use Tomcat, execute load-stress simulator(s) from separate machine(s), and measure... don't forget to allocate necessary thread pools in Tomcat (if you have to)... Although such JSP doesn't use any memory, you will see how easy one can go with 5000 TPS (or 'virtually' 5 concurrent users) on modern quad-cores by simply allocating more memory (...GB) and more Tomcat threads. There is threshold too... repeat it with HTTPD Workers (and threads), same result, although it doesn't use any GC. More memory - more threads - more keep alives per TCP... However, 'theoretically' you need only 64Mb for Hello World :)))
Re: problem with HTMLStripStandardTokenizerFactory
Can you give a small test file that demonstrates the problem? -Yonik http://www.lucidimagination.com On Fri, Sep 25, 2009 at 5:34 AM, Kundig, Andreas andreas.kun...@wipo.int wrote: Hello I can't bring HTMLStripStandardTokenizerFactory to remove the content of the style tag, as the documentation says it should. A search for 'mso' returns a document where the search term only appears in the style tag (it's a word document saved as html). Here is the highlight returned by solr (by the way: the wrong word is highlighted). vetica;#13;\n\tpanose-1:2 11 5 4 2 2 2 2 2 4;em#13/em;\n\tmso-font-charset:0;em#13/em;\n\tmso-generic-font-family:swiss;em#13/em I am using solr 1.3. Here is how I configured the tokenizer in schema.xml fieldType name=text_en class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.HTMLStripStandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.HTMLStripStandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Am I doing something wrong? thank you Andréas Kündig World Intellectual Property Organization Disclaimer: This electronic message may contain privileged, confidential and copyright protected information. If you have received this e-mail by mistake, please immediately notify the sender and delete this e-mail and all its attachments. Please ensure all e-mail attachments are scanned for viruses prior to opening or using.
Problem changing the default MergePolicy/Scheduler
Hello, It looks like solr is not allowing me to change the default MergePolicy/Scheduler classes. Even if I change the default MergePolicy/ Scheduler(LogByteSizeMErgePolicy and ConcurrentMergeScheduler) defined in solrconfig.xml to a different one (LogDocMergePolicy and SerialMergeScheduler), my profiler shows the default classes are still being loaded. Also, if I use the default LogByteSizeMergePolicy, I can't seem to override the 'calibrateSizeByDeletes' to 'true' value using solrconfig using the new syntax that was introduced this week (SOLR-1447). I'm using the version checked out from trunk yesterday. Any pointers will be helpful. Thanks, -Jibo