Re: Is there any strss test tool for testing Solr?
Thanks to both Gora & Amit. A little information for people who concern this discussion, I found there's a SolrMeter open source project in Google Code - http://code.google.com/p/solrmeter/, it's specifically for load test of Solr - I'll evaluate following tools & pick up one for my testing: WebStress Apache Bench JMeter SolrMetre Oh, I'll correct a wrong information my post: We're builiding a 12-million newspaper index, rather than 1.2 million. Scott - Original Message - From: "Gora Mohanty" To: Sent: Friday, August 27, 2010 2:22 AM Subject: Re: Is there any strss test tool for testing Solr? On Wed, 25 Aug 2010 19:58:36 -0700 Amit Nithian wrote: i recommend JMeter. We use that to do load testing on a search server. [...] JMeter is certainly good, but we have also found Apache bench to also be of much use. Maybe it is just us, and what we are familiar with, but Apache bench seemed easier to automate. Also, much easier to get up and running with, at least IMHO. Be careful though.. as silly as this may sound.. do NOT just issue random queries because that won't exercise your caches... [...] Conversely, we are still trying to figure out how to make real-life measurements, without having the Solr cache coming into the picture. For querying on a known keyword, every hit after the first, with Apache bench, is strongly affected by the Solr cache. We tried using random strings, but at least with Apache bench, the query string is fixed for each invocation of Apache bench. Have to investigate whether one can do otherwise with JMeter plugins. Also, a query that returns no result (as a random query string typically would) seems to be significantly faster than a real query. So, I think that in the long run, the best way is to build information about *typical* queries that your users run; using the Solr logs, and then use a set of such queries for benchmarking. Regards, Gora ¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C Checked by AVG - www.avg.com Version: 9.0.851 / Virus Database: 271.1.1/3094 - Release Date: 08/26/10 02:34:00
Repliaction in 1.4 "Replicate Now" works, but scheduled rep. does not
Hello, I am upgrading from 1.3 to 1.4 and setting up new replication method. On master I added this section: commit startup schema.xml,stopwords.txt On slave: http://localhost:8085/solr/replication 00:15:00 They are on the same server different ports, this is QA environment. master is on 8085, slave on 8086. master started with -Denable.master=true -Denable.slave=false slave started with -Denable.master=false -Denable.slave=true Master admin page has Replication link and response on master details fine. Slave does not have a Replicatio link on admin page, but when I go the directly and click Replicate Now, It does work. But it is failing when it trying to run on schedule (every 15 minutes) Also Slave replication admin page does not have status of master. This is detaile dinfo from slave: If you check schedule I have only one succeful replication that I run manually at: Sun Aug 29 17:52:17 EDT 2010 others are running on 15 minute all failing. http://devslave:8086/solr/replication?command=details − 0 11 − 504.02 MB /usr/local/solr/solr_home_qa_slave/data/index false true 1283100903442 2 − − 504.02 MB /usr/local/solr/solr_home_qa/data/index − − 1283100903442 2 − _0.nrm _0.tis _0.fnm _0.tii _0.frq segments_2 _0.fdx _0.prx _0.fdt true false 1283100903442 2 http://SOLRMASTERDEV:8085/solr/replication 00:15:00 Sun Aug 29 18:30:00 EDT 2010 − Sun Aug 29 18:30:00 EDT 2010 Sun Aug 29 17:52:17 EDT 2010 Sun Aug 29 17:45:00 EDT 2010 Sun Aug 29 17:30:00 EDT 2010 Sun Aug 29 17:15:00 EDT 2010 Sun Aug 29 17:00:00 EDT 2010 Sun Aug 29 16:51:26 EDT 2010 Sun Aug 29 16:45:00 EDT 2010 Sun Aug 29 16:41:34 EDT 2010 Sun Aug 29 16:30:00 EDT 2010 − Sun Aug 29 18:30:00 EDT 2010 Sun Aug 29 17:45:00 EDT 2010 Sun Aug 29 17:30:00 EDT 2010 Sun Aug 29 17:15:00 EDT 2010 Sun Aug 29 17:00:00 EDT 2010 Sun Aug 29 16:51:26 EDT 2010 Sun Aug 29 16:45:00 EDT 2010 Sun Aug 29 16:41:34 EDT 2010 Sun Aug 29 16:30:00 EDT 2010 Sun Aug 29 16:15:00 EDT 2010 13 0 12 Sun Aug 29 18:30:00 EDT 2010 0 false false − This response format is experimental. It is likely to change in the future. -- View this message in context: http://lucene.472066.n3.nabble.com/Repliaction-in-1-4-Replicate-Now-works-but-scheduled-rep-does-not-tp1386024p1386024.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Updating document without removing fields
No. Document creation is all-or-nothing, fields are not updateable. I think you have to filter all of your field changes through a "join" server. That is, all field updates could go to a database and the master would read document updates from that database. Or, you could have one updater feed updates to the other, The sends all updates to the master. Lance On Sun, Aug 29, 2010 at 6:19 PM, Max Lynch wrote: > Hi, > I have a master solr server and two slaves. On each of the slaves I have > programs running that read the slave index, do some processing on each > document, add a few new fields, and commit the changes back to the master. > > The problem I'm running into right now is one slave will update one document > and the other slave will eventually update the same document, but the > changes will overwrite each other. For example, one slave will add a field > and commit the document, but the other slave won't have that field yet so it > won't duplicate the document when it updates the doc with its own new > field. This causes the document to miss one set of fields from one of the > slaves. > > Can I update a document without having to recreate it? Is there a way to > update the slave and then have the slave commit the changes to the master > (adding new fields in the process?) > > Thanks. > -- Lance Norskog goks...@gmail.com
Updating document without removing fields
Hi, I have a master solr server and two slaves. On each of the slaves I have programs running that read the slave index, do some processing on each document, add a few new fields, and commit the changes back to the master. The problem I'm running into right now is one slave will update one document and the other slave will eventually update the same document, but the changes will overwrite each other. For example, one slave will add a field and commit the document, but the other slave won't have that field yet so it won't duplicate the document when it updates the doc with its own new field. This causes the document to miss one set of fields from one of the slaves. Can I update a document without having to recreate it? Is there a way to update the slave and then have the slave commit the changes to the master (adding new fields in the process?) Thanks.
Re: Multiple passes with WordDelimiterFilterFactory
There's nothing built into SOLR that I know of that'll deal with auto-detecting multiple languages and "doing the right thing". I know there's been discussion of that, searching the users' list might help... You may have to write your own analyzer that tries to do this, but I have no clue how you'd go about it. <<>> Try putting this after any instances of, say, WhiteSpaceTokenizerFactory in your analyzser definition, and I believe you'll see that this is not true. At least looking at this in the analysis page from SOLR admin sure doesn't seem to support that assertion. This last doesn't help much with the different character sets though.. I'll have to leave any other insights to wiser heads than mine though.. Best Erick On Sun, Aug 29, 2010 at 12:44 PM, Shawn Heisey wrote: > Thank you for taking the time to help. The way I've got the word > delimiter index filter set up with only one pass, "wolf-biederman" will > result in wolf, biederman, wolfbiederman, and wolf-biederman. With two > passes, the last one is not present. One pass changes "gremlin's" to > gremlin and gremlin's. Two passes results in gremlin and gremlins. > > I was trying to use the PatternReplaceCharFilterFactory to strip leading > and trailing punctuation, but it didn't work. It seems that charFilters are > applied even before the tokenizer, which will not produce the results I > want, and the filter I'd come up with was eating everything, producing no > results. I later realized that it would not work with radically different > character sets like Arabic and Cyrillic, even if I solved those problems. > Is there a regular filter that could strip leading/trailing punctuation? > > As for stemming, we have no effective way to separate the languages. Most > of the content is English, but we also have Spanish, Arabic, Russian, > German, French, and possibly a few others. For that reason, I'm not using > stemming. I've been thinking that I might want to use an English stemmer > anyway to improve results on most of the content, but I haven't done any > testing yet. > > Thanks, > Shawn > > > > On 8/29/2010 12:28 PM, Erick Erickson wrote: > >> Look at the tokenizer/filter chain that makes up your analyzers, and see: >> >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters >> >> for other tokenizer/analyzer/filter options. >> >> You're on the right track looking at the various choices provided, and >> I suspect you'll find what you need... >> >> Be a little cautious about preserving things. Your users will often be >> more >> confused than helped if you require hyphens for a match. Ditto with >> possessives, plurals, etc. You might want to look at stemmers >> > >
Re: Search Results optimization
also my request handler looks like this dismax name ^2.4 0.1 I really need some help on this, again, what I want is...if I search for "swingline red stapler", In results, docs that have all three keywords should come on top, then docs that have any 2 keywords and then docs with 1 keyword, i mean in my sorted order. thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Search-Results-optimization-tp1129374p1385572.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple passes with WordDelimiterFilterFactory
Thank you for taking the time to help. The way I've got the word delimiter index filter set up with only one pass, "wolf-biederman" will result in wolf, biederman, wolfbiederman, and wolf-biederman. With two passes, the last one is not present. One pass changes "gremlin's" to gremlin and gremlin's. Two passes results in gremlin and gremlins. I was trying to use the PatternReplaceCharFilterFactory to strip leading and trailing punctuation, but it didn't work. It seems that charFilters are applied even before the tokenizer, which will not produce the results I want, and the filter I'd come up with was eating everything, producing no results. I later realized that it would not work with radically different character sets like Arabic and Cyrillic, even if I solved those problems. Is there a regular filter that could strip leading/trailing punctuation? As for stemming, we have no effective way to separate the languages. Most of the content is English, but we also have Spanish, Arabic, Russian, German, French, and possibly a few others. For that reason, I'm not using stemming. I've been thinking that I might want to use an English stemmer anyway to improve results on most of the content, but I haven't done any testing yet. Thanks, Shawn On 8/29/2010 12:28 PM, Erick Erickson wrote: Look at the tokenizer/filter chain that makes up your analyzers, and see: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for other tokenizer/analyzer/filter options. You're on the right track looking at the various choices provided, and I suspect you'll find what you need... Be a little cautious about preserving things. Your users will often be more confused than helped if you require hyphens for a match. Ditto with possessives, plurals, etc. You might want to look at stemmers
Re: ExternalFileField best practices
The extended dismax parser (see SOLR-1553) may do what you are looking for From its feature list.. 'Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead of adding it in' On Sun, Aug 29, 2010 at 12:27 AM, Andy wrote: > But isn't it the case that bf adds the boost value while {!boost} multiply > the boost value? In my case I think a multiplication is more appropriate. > > So there's no way to use ExternalFileField in {!boost}? > > --- On Sat, 8/28/10, Lance Norskog wrote: > > > From: Lance Norskog > > Subject: Re: ExternalFileField best practices > > To: solr-user@lucene.apache.org > > Date: Saturday, August 28, 2010, 11:55 PM > > You want the boost function bf= > > parameter. > > > > On Sat, Aug 28, 2010 at 5:32 PM, Andy > > wrote: > > > Lance, > > > > > > Thanks for the response. > > > > > > Can I use an ExternalFileField as an input to a boost > > query? > > > > > > For example, if I put the field "popularity" in an > > ExternalFileField, can I still use "popularity" in a boosted > > query such as: > > > > > > {!boost b=log(popularity)}foo > > > > > > The doc says ExternalFileField can only be used in > > FunctionQuery. Does that include a boost query like {!boost > > b=log(popularity)}? > > > > > > > > > --- On Sat, 8/28/10, Lance Norskog > > wrote: > > > > > >> From: Lance Norskog > > >> Subject: Re: ExternalFileField best practices > > >> To: solr-user@lucene.apache.org > > >> Date: Saturday, August 28, 2010, 5:16 PM > > >> The file is completely reloaded when > > >> you commit or optimize. There is > > >> no incremental update available. And, yes, this > > could be a > > >> scaling > > >> problem. > > >> > > >> How you update it is completely external to Solr. > > >> > > >> On Sat, Aug 28, 2010 at 2:50 AM, Andy > > >> wrote: > > >> > I'm interested in using ExternalFileField to > > store a > > >> field "popularity" that is being updated > > frequently. > > >> > > > >> > However ExternalFileField seems to be a > > pretty obscure > > >> feature. Have a few questions: > > >> > > > >> > 1) Can anyone share your experience using > > it? > > >> > > > >> > 2) What is the most efficient way to update > > the > > >> external file? > > >> > For example, the file could look like: > > >> > > > >> > 1=12 // the document with uniqueKey 1 > > has a > > >> popularity of 12// > > >> > 2=4 > > >> > 3=45 > > >> > 5=78 > > >> > > > >> > Now the popularity of document 1 is updated > > to 13: > > >> > > > >> > - What is the best way to update the file to > > reflect > > >> the change? Isn't this an O(n) operation? > > >> > - How to deal with concurrent updates to the > > file by > > >> multiple threads? > > >> > > > >> > Would this method of using an external file > > scale? > >
Re: Multiple passes with WordDelimiterFilterFactory
Look at the tokenizer/filter chain that makes up your analyzers, and see: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for other tokenizer/analyzer/filter options. You're on the right track looking at the various choices provided, and I suspect you'll find what you need... Be a little cautious about preserving things. Your users will often be more confused than helped if you require hyphens for a match. Ditto with possessives, plurals, etc. You might want to look at stemmers Best Erick On Sat, Aug 28, 2010 at 6:20 PM, Shawn Heisey wrote: > It's metadata for a collection of 45 million documents that is mostly > photos, with some videos and text. The data is imported from a MySQL > database and split among six large shards (each nearly 13GB) and a small > shard with data added in the last week. That works out to between 300,000 > and 500,000 documents. > > I am mostly trying to think of ways to drastically reduce the index size > without reducing the functionality. Using copyField would just make it > larger. > > I would like to make it so that I don't have two terms when there's a > punctuation character at the beginning or end of a word. For intstance, one > field value that I just analyzed ends up with terms like the following, > which are unneeded duplicates: > > > championship. > championship > '04 > 04 > wisconsin. > wisconsin > > Since I was already toying around, I just tested the whole notion. I ran > it through once with just generateWordParts and catenateWords enabled, then > again with all the options including preserveOriginal enabled. A test > analysis of input with 59 whitespace separated words showed 93 terms with > the single filter and 77 with two. The only drop in term quality that I > noticed was that possessive words (apostrophe-s) no longer have the original > preserved. I haven't yet decided whether that's a problem. > > > Shawn > > > On 8/27/2010 11:00 AM, Erick Erickson wrote: > >> I agree with Marcus, the usefulness of passing through WDF twice >> is suspect. You can always do a copyfield to a completely different >> field and do whatever you want there, copyfield forks the raw input >> to the second field, not the analyzed stream... >> >> What is it you're really trying to accomplish? Your use-case would >> help us help you. >> >> About defining things differently in index and analysis. Sure, it can >> make sense. But, especially with WDF it's tricky. Spend some >> significant time in the admin analysis page looking at the effects >> of various configurations before you decide. >> >> Best >> Erick >> > >
Re: Multiple passes with WordDelimiterFilterFactory
On 8/28/2010 7:59 PM, Shawn Heisey wrote: The only drop in term quality that I noticed was that possessive words (apostrophe-s) no longer have the original preserved. I haven't yet decided whether that's a problem. I finally did notice another drop in term quality from the dual pass - words with punctuation in the middle (like wolf-biederman) are not preserved with that punctuation intact. I need a different filter to strip non-alphanumerics from the beginning and end of terms, that gets run after the tokenizer and the ASCII folding filter but before the word delimeter filter. Does such a thing already exist, or do I just need to use something that does regex? Are there any recommended regex patterns out there for this? Thanks, Shawn
anybody using solr with Cassandra?
Hi, Is anybody using Solr with Cassandra? Are there any Gotcha's? Thanks --Siju
Re: Multiple passes with WordDelimiterFilterFactory
It's metadata for a collection of 45 million documents that is mostly photos, with some videos and text. The data is imported from a MySQL database and split among six large shards (each nearly 13GB) and a small shard with data added in the last week. That works out to between 300,000 and 500,000 documents. I am mostly trying to think of ways to drastically reduce the index size without reducing the functionality. Using copyField would just make it larger. I would like to make it so that I don't have two terms when there's a punctuation character at the beginning or end of a word. For intstance, one field value that I just analyzed ends up with terms like the following, which are unneeded duplicates: championship. championship '04 04 wisconsin. wisconsin Since I was already toying around, I just tested the whole notion. I ran it through once with just generateWordParts and catenateWords enabled, then again with all the options including preserveOriginal enabled. A test analysis of input with 59 whitespace separated words showed 93 terms with the single filter and 77 with two. The only drop in term quality that I noticed was that possessive words (apostrophe-s) no longer have the original preserved. I haven't yet decided whether that's a problem. Shawn On 8/27/2010 11:00 AM, Erick Erickson wrote: I agree with Marcus, the usefulness of passing through WDF twice is suspect. You can always do a copyfield to a completely different field and do whatever you want there, copyfield forks the raw input to the second field, not the analyzed stream... What is it you're really trying to accomplish? Your use-case would help us help you. About defining things differently in index and analysis. Sure, it can make sense. But, especially with WDF it's tricky. Spend some significant time in the admin analysis page looking at the effects of various configurations before you decide. Best Erick