RE: Is Solr right for my business situation ?
Thanks for the responses people. @Grant 1. can you show me some direction on that.. loading data from an incoming stream.. do I need some third party tools, or need to build something myself... 4. I am basically attempting to build a very fast search interface for the existing data. The volume I mentioned is more like static one (data is already there). The sql statements I mentioned are daily updates coming. The good thing is that the history is not there, so the overall volume is not growing, but I need to apply the update statements. One workaround I had in mind is, (though not so great performance) is to apply the updates to a copy of rdbms, and then feed the rdbms extract to solr. Sounds like overkill, but I don't have another idea right now. Perhaps business discussions would yield something. @All - Some more questions guys. 1. I have about 3-5 tables. Now designing schema.xml for a single table looks ok, but whats the direction for handling multiple table structures is something I am not sure about. Would it be like a big huge xml, wherein those three tables (assuming its three) would show up as three different tag-trees, nullable. My source provides me a single flat file per table (tab delimited). 2. Further, loading into solr can use some perf tuning.. any tips ? best practices ? 3. Also, is there a way to specify a xslt at the server side, and make it default, i.e. whenever a response is returned, that xslt is applied to the response automatically... 4. And last question for the day - :) there was one post saying that the spatial support is really basic in solr and is going to be improved in next versions... Can you ppl help me get a definitive yes or no on spatial support... in the current form, does it work on not ? I would store lat and long, and would need to make them searchable... Looks like I m close to my solution.. :) --raghav -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Tuesday, September 28, 2010 1:05 AM To: solr-user@lucene.apache.org Subject: Re: Is Solr right for my business situation ? Inline. On Sep 27, 2010, at 1:26 PM, Walter Underwood wrote: When do you need to deploy? As I understand it, the spatial search in Solr is being rewritten and is slated for Solr 4.0, the release after next. It will be in 3.x, the next release The existing spatial search has some serious problems and is deprecated. Right now, I think the only way to get spatial search in Solr is to deploy a nightly snapshot from the active development on trunk. If you are deploying a year from now, that might change. There is not any support for SQL-like statements or for joins. The best practice for Solr is to think of your data as a single table, essentially creating a view from your database. The rows become Solr documents, the columns become Solr fields. There is now group-by capabilities in trunk as well, which may or may not help. wunder On Sep 27, 2010, at 9:34 AM, Sharma, Raghvendra wrote: I am sure these kind of questions keep coming to you guys, but I want to raise the same question in a different context...my own business situation. I am very very new to solr and though I have tried to read through the documentation, I have nowhere near completing the whole read. The need is like this - We have a huge rdbms database/table. A single table perhaps houses 100+ million rows. Though oracle is doing a fine job of handling the insertion and updation of data, the querying is where our main concerns lie. Since we have spatial data, the index building takes hours and hours for such tables. That's when we thought of moving away from standard rdbms and thought of trying something different and fast. My last week has been spent in a journey reading through bigtable to hadoop to hbase, to hive and then finally landed on solr. As far as I am in my tests, it looks pretty good, but I have a few unanswered questions still. Trying this group for them :) (I am sure I can find some answers if I read/google more on the topic, but now I m being lazy and feel asking the people who are already using it/or perhaps developing it is a better bet). 1. Can I get my solr instance to load data (fresh data for indexing) from a stream (imagine a mq kind of queue, or similar) ? Yes, with a little bit of work. 2. Can I host my solr instance to use hbase as the database/file system (read HDFS) ? Probably, but I doubt it will be fast. Local disk is usually the best. 100+ M rows is large but not unreasonable. 3. are there somewhere any reports available (as in benchmarks ) for a solr instance's performance ? You can probably search the web for these. I've personally seen several installs w/ 1B+ docs and subsecond search and faceting and heard of others. You might look at the stuff the Hathi trust has put up. 4. are there any APIs available which might help me
RE: Is Solr right for my business situation ?
Staging the data in a non-Solr store sounds like a potentially reasonable idea to me. You might want to consider a NoSQL store of some kind like MongoDB perhaps, instead of an rdbms. The way to think about Solr is not as a store or a database -- it's an index for serving your application. That's also the way to think about how to get your multiple tables in there -- denormalize, denormalize, denormalize. You need to think about what you actually need to search over, and build your index to serve that efficiently, rather than thinking about normalization or data modelling the way we are used to with rdbms's, it's a different way of thinking. A Solr index basically gives you one collection of documents. But the documents can all have different fields -- so you _could_ (but probably don't want to) essentially put all your tables in there with unique fields --they're all in the same index, they're all just documents, but some have a table1_title and table1_author, and others have no data in those fields but a table2_productName and a table2_price. Then if you want to query on just one type of thing, you just query on those fields. Except... you don't get any joins. Which is why you probably don't want to do that after all, it probably won't serve your needs. Figuring out the right way to model your data in Solr can be tricky, and it is sometimes hard to do exactly what you want. Solr isn't an rdbms, and in some ways isn't as powerful as an rdbms -- in the sense of being as flexible with what kinds of queries you can run on any given data. What it does is give you very fast access to inverted index lookups and set combinations and facetting that would be very hard to do efficiently in an rdbms. It is a trade-off. But there's not really a general answer to how do I take these dozen rdbms tables and store them in Solr the best way? -- it depends on what kinds of searching you need to support and the nature of your data. From: Sharma, Raghvendra [sraghven...@corelogic.com] Sent: Tuesday, September 28, 2010 2:15 AM To: solr-user@lucene.apache.org Subject: RE: Is Solr right for my business situation ? Thanks for the responses people. @Grant 1. can you show me some direction on that.. loading data from an incoming stream.. do I need some third party tools, or need to build something myself... 4. I am basically attempting to build a very fast search interface for the existing data. The volume I mentioned is more like static one (data is already there). The sql statements I mentioned are daily updates coming. The good thing is that the history is not there, so the overall volume is not growing, but I need to apply the update statements. One workaround I had in mind is, (though not so great performance) is to apply the updates to a copy of rdbms, and then feed the rdbms extract to solr. Sounds like overkill, but I don't have another idea right now. Perhaps business discussions would yield something. @All - Some more questions guys. 1. I have about 3-5 tables. Now designing schema.xml for a single table looks ok, but whats the direction for handling multiple table structures is something I am not sure about. Would it be like a big huge xml, wherein those three tables (assuming its three) would show up as three different tag-trees, nullable. My source provides me a single flat file per table (tab delimited). 2. Further, loading into solr can use some perf tuning.. any tips ? best practices ? 3. Also, is there a way to specify a xslt at the server side, and make it default, i.e. whenever a response is returned, that xslt is applied to the response automatically... 4. And last question for the day - :) there was one post saying that the spatial support is really basic in solr and is going to be improved in next versions... Can you ppl help me get a definitive yes or no on spatial support... in the current form, does it work on not ? I would store lat and long, and would need to make them searchable... Looks like I m close to my solution.. :) --raghav -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Tuesday, September 28, 2010 1:05 AM To: solr-user@lucene.apache.org Subject: Re: Is Solr right for my business situation ? Inline. On Sep 27, 2010, at 1:26 PM, Walter Underwood wrote: When do you need to deploy? As I understand it, the spatial search in Solr is being rewritten and is slated for Solr 4.0, the release after next. It will be in 3.x, the next release The existing spatial search has some serious problems and is deprecated. Right now, I think the only way to get spatial search in Solr is to deploy a nightly snapshot from the active development on trunk. If you are deploying a year from now, that might change. There is not any support for SQL-like statements or for joins. The best practice for Solr is to think of your data as a single table,
Re: Re:The search response time is too loong
I guess you are correct. We used the default SOLR cache configuration. I will change the cache configuration. BTW, I want to deploy several shards from the existing 8G index file, such as 4G per shards. Is there any tool to generate two shards from one 8G index file? From: kenf_nc ken.fos...@realestate.com Reply-To: solr-user@lucene.apache.org To: solr-user@lucene.apache.org Subject: Re: Re:The search response time is too loong Date: Mon, 27 Sep 2010 05:37:25 -0700 (PDT) mem usage is over 400M, do you mean Tomcat mem size? If you don't give your cache sizes enough room to grow you will choke the performance. You should adjust your Tomcat settings to let the cache grow to at least 1GB or better would be 2GB. You may also want to look into http://wiki.apache.org/solr/SolrCaching warming the cache to make the first time call a little faster. For comparison, I also have about 8GB in my index but only 2.8 million documents. My search query times on a smaller box than you specify are 6533 milliseconds on an unwarmed (newly rebooted) instance. -- View this message in context: http://lucene.472066.n3.nabble.com/Re-The-search-response-time-is-too-loong-tp1587395p1588554.html Sent from the Solr - User mailing list archive at Nabble.com.
What's the difference between TokenizerFactory, Tokenizer, Analyzer?
Could someone help me to understand the differences between TokenizerFactory, Tokenizer, Analyzer? Specifically, I'm interested in implementing auto-complete for tags that could contain both English Chinese. I read this article (http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/). In the article KeywordTokenizerFactory is used as tokenizer. I thought I'd try replacing that with CJKTokenizer. 2 questions: 1) KeywordTokenizerFactory seems to be a tokenizer factory while CJKTokenizer seems to be just a tokenizer. Are they the same type of things at all? Could I just replace tokenizer class=solr.KeywordTokenizerFactory/ with tokenizer class=org.apache.lucene.analysis.cjk.CJKTokenizer/ ?? 2) I'm also interested in trying out SmartChineseAnalyzer (http://lucene.apache.org/java/2_9_0/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html) However SmartChineseAnalyzer doesn't offer a separate tokenizer. It's just an analyzer and that's it. How do I use it in Solr? Thanks. Andy
Re: Is Solr right for our project?
Interesting. So what you are saying, though, is that at the moment it is NOT there? On Mon, Sep 27, 2010 at 9:06 PM, Jan Høydahl / Cominvent jan@cominvent.com wrote: Solr will match this in version 3.1 which is the next major release. Read this page: http://wiki.apache.org/solr/SolrCloud for feature descriptions Coming to a trunk near you - see https://issues.apache.org/jira/browse/SOLR-1873 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 27. sep. 2010, at 17.44, Mike Thomsen wrote: (I apologize in advance if I missed something in your documentation, but I've read through the Wiki on the subject of distributed searches and didn't find anything conclusive) We are currently evaluating Solr and Autonomy. Solr is attractive due to its open source background, following and price. Autonomy is expensive, but we know for a fact that it can handle our distributed search requirements perfectly. What we need to know is if Solr has capabilities that match or roughly approximate Autonomy's Distributed Search Handler. What it does it acts as a front-end for all of Autonomy's IDOL search servers (which correspond in this scenario to Solr shards). It is configured to know what is on each shard, which servers hold each shard and intelligently farms out queries based on that configuration. There is no need to specify which IDOL servers to hit while querying; the DiSH just knows where to go. Additionally, I believe in cases where an index piece is mirrored, it also monitors server health and falls back intelligently on other backup instances of a shard/index piece based on that. I'd appreciate it if someone can give me a frank explanation of where Solr stands in this area. Thanks, Mike
Limitations of prohibited clausses in sub-expression - pure negative query
I can find the answer but is this problem solved in Solr 1.4.1 ? Thx for your answers.
is multi-threads searcher feasible idea to speed up?
hi all I want to speed up search time for my application. In a query, the time is largly used in reading postlist(io with frq files) and calculate scores and collect result(cpu, with Priority Queue). IO is hardly optimized or already part optimized by nio. So I want to use multithreads to utilize cpu. of course, it may be decrease QPS, but the response time will also decrease-- that what I want. Because cpu is easily obtained compared to faster hard disk. I read the codes of searching roughly and find it's not an easy task to modify search process. So I want to use other easy method . One is use solr distributed search and dispatch documents to many shards. but due to the network and global idf problem,it seems not a good method for me. Another one is to modify the index structure and averagely dispatch frq files. e.gterm1 - doc1,doc2, doc3,doc4,doc5 in _1.frq I create to 2 indexes with term1-doc1,doc3,doc5 term1-doc2,doc4 when searching, I create 2 threads with 2 PriorityQueues to collect top N docs and merging their results Is the 2nd idea feasible? Or any one has related idea? thanks.
Re: is multi-threads searcher feasible idea to speed up?
This is an excellent idea! And, desperately needed. It's high time Lucene can take advantage of concurrency when running a single query. Machines have tons of cores these days! (My dev box has 24!). Note that one simple way to do this is use ParallelMultiSearcher: it uses one thread per segment in your index. But, note that [perversely] this means if your index is optimized you get no concurrency gain! So, you have to create your index w/ a carefully picked maxMergeDocs/MB to ensure you can use concurrency. I don't like having concurrency tied to index structure. So a better approach would be to have each thread pull its own Scorer for the same query, but then each one does a .advance to it's chunk of the index, and then iterates from there. Then merge PQs in the end just like MultiSearcher. Mike On Tue, Sep 28, 2010 at 7:24 AM, Li Li fancye...@gmail.com wrote: hi all I want to speed up search time for my application. In a query, the time is largly used in reading postlist(io with frq files) and calculate scores and collect result(cpu, with Priority Queue). IO is hardly optimized or already part optimized by nio. So I want to use multithreads to utilize cpu. of course, it may be decrease QPS, but the response time will also decrease-- that what I want. Because cpu is easily obtained compared to faster hard disk. I read the codes of searching roughly and find it's not an easy task to modify search process. So I want to use other easy method . One is use solr distributed search and dispatch documents to many shards. but due to the network and global idf problem,it seems not a good method for me. Another one is to modify the index structure and averagely dispatch frq files. e.g term1 - doc1,doc2, doc3,doc4,doc5 in _1.frq I create to 2 indexes with term1-doc1,doc3,doc5 term1-doc2,doc4 when searching, I create 2 threads with 2 PriorityQueues to collect top N docs and merging their results Is the 2nd idea feasible? Or any one has related idea? thanks. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: is multi-threads searcher feasible idea to speed up?
yes, there is a multisearcher in lucene. but it's idf in 2 indexes are not global. maybe I can modify it and also the index like: term1 df=5 doc1 doc3 doc5 term1 df=5 doc2 doc4 2010/9/28 Li Li fancye...@gmail.com: hi all I want to speed up search time for my application. In a query, the time is largly used in reading postlist(io with frq files) and calculate scores and collect result(cpu, with Priority Queue). IO is hardly optimized or already part optimized by nio. So I want to use multithreads to utilize cpu. of course, it may be decrease QPS, but the response time will also decrease-- that what I want. Because cpu is easily obtained compared to faster hard disk. I read the codes of searching roughly and find it's not an easy task to modify search process. So I want to use other easy method . One is use solr distributed search and dispatch documents to many shards. but due to the network and global idf problem,it seems not a good method for me. Another one is to modify the index structure and averagely dispatch frq files. e.g term1 - doc1,doc2, doc3,doc4,doc5 in _1.frq I create to 2 indexes with term1-doc1,doc3,doc5 term1-doc2,doc4 when searching, I create 2 threads with 2 PriorityQueues to collect top N docs and merging their results Is the 2nd idea feasible? Or any one has related idea? thanks.
Re: Search Interface
Hi You could try to use the Velocity framework to build GUIs in a quick and efficent manner. Solr come with a velocity handler already integrated that could be the best solution in your case: http://wiki.apache.org/solr/VelocityResponseWriter Also take these hints on the same topic: http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/ there is also a webinar about rapid prototyping with solr: http://www.slideshare.net/erikhatcher/rapid-prototyping-with-solr-4312681 Hope this help Antonio Il 28/09/2010 4.35, Claudio Devecchi ha scritto: Hi everybody, I`m implementing my first solr engine for conceptual tests, I`m crawling my wiki intranet to make some searches, the engine is working fine already, but I need some interface to make my searchs. Somebody knows where can I find some search interface just for customizations? Tks
Re: Limitations of prohibited clausses in sub-expression - pure negative query
Please explain what you want to *do*, your message is so terse it makes it really hard to figure out what you're asking. A couple of example queries would help a lot. Best Erick On Tue, Sep 28, 2010 at 5:53 AM, Patrick Sauts patrick.via...@gmail.comwrote: I can find the answer but is this problem solved in Solr 1.4.1 ? Thx for your answers.
RE: Limitations of prohibited clausses in sub-expression - pure negative query
Maybe SOLR-80 jira issue ? As written in Solr 1.4 book; pure negative query doesn't work correctly . you have to add 'AND *:* ' thx From: Patrick Sauts [mailto:patrick.via...@gmail.com] Sent: mardi 28 septembre 2010 11:53 To: 'solr-user@lucene.apache.org' Subject: Limitations of prohibited clausses in sub-expression - pure negative query I can find the answer but is this problem solved in Solr 1.4.1 ? Thx for your answers.
Re: Is Solr right for our project?
Yes, in the latest released version (1.4.1), there is a shards= parameter but the client needs to fill it, i.e. the client needs to know what servers are indexers, searchers, shard masters and shard replicas... The SolrCloud stuff is still not committed and only available as a patch right now. However, we encourage you to do a test install based on TRUNK+SOLR-1873 and give it a try. But we cannot guarantee that the APIs will not change in the released version (hopefully 3.1 sometime this year). -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 28. sep. 2010, at 10.44, Mike Thomsen wrote: Interesting. So what you are saying, though, is that at the moment it is NOT there? On Mon, Sep 27, 2010 at 9:06 PM, Jan Høydahl / Cominvent jan@cominvent.com wrote: Solr will match this in version 3.1 which is the next major release. Read this page: http://wiki.apache.org/solr/SolrCloud for feature descriptions Coming to a trunk near you - see https://issues.apache.org/jira/browse/SOLR-1873 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 27. sep. 2010, at 17.44, Mike Thomsen wrote: (I apologize in advance if I missed something in your documentation, but I've read through the Wiki on the subject of distributed searches and didn't find anything conclusive) We are currently evaluating Solr and Autonomy. Solr is attractive due to its open source background, following and price. Autonomy is expensive, but we know for a fact that it can handle our distributed search requirements perfectly. What we need to know is if Solr has capabilities that match or roughly approximate Autonomy's Distributed Search Handler. What it does it acts as a front-end for all of Autonomy's IDOL search servers (which correspond in this scenario to Solr shards). It is configured to know what is on each shard, which servers hold each shard and intelligently farms out queries based on that configuration. There is no need to specify which IDOL servers to hit while querying; the DiSH just knows where to go. Additionally, I believe in cases where an index piece is mirrored, it also monitors server health and falls back intelligently on other backup instances of a shard/index piece based on that. I'd appreciate it if someone can give me a frank explanation of where Solr stands in this area. Thanks, Mike
RE: Need help with spellcheck city name
You might want to look at SOLR-2010. This patch works with the collation feature, having it test the collations it returns to ensure they'll return hits. So if a user types san jos it will know that the combination san jose is in the index and san ojos is not. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: Monday, September 27, 2010 7:45 PM To: solr-user@lucene.apache.org Cc: erickerick...@gmail.com Subject: Re: Need help with spellcheck city name No, I checked, there is a city called Swan in Iowa. So, it is getting from the city index, so is Clerk. But why does it favor Swan than San? Spellcheck get weird after I treat city name as one token. If I do it in the old way, it let San go, and correct Jos as Ojos instead of Jose because Ojos is ranked as #1 and Jose at the middle. Any more suggestions? Rank it by frequency first then score doesn't work neither. From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Sent: Mon, September 27, 2010 5:24:25 PM Subject: Re: Need help with spellcheck city name Hmmm, did you rebuild your spelling index after the config changes? And it really looks like somehow you're getting results from a field other than city. Are you also sure that your cityname field is of type autocomplete1? Shooting in the dark here, but these results are so weird that I suspect it's something fundamental Best Erick On Mon, Sep 27, 2010 at 8:05 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: No, it doesn't work, I got weird result. I set my city name field to be parsed as a token as following: fieldType name=autocomplete1 class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType I got following result for spellcheck: lstname=spellcheck - lstname=suggestions - lstname=san intname=numFound1/int intname=startOffset0/int intname=endOffset3/int - arrname=suggestion strswan/str /arr /lst - lstname=clar intname=numFound1/int intname=startOffset4/int intname=endOffset8/int arrname=suggestion strclark/str /arr /lst /lst From: Tom Hill solr-l...@worldware.com To: solr-user@lucene.apache.org Sent: Mon, September 27, 2010 3:52:48 PM Subject: Re: Need help with spellcheck city name Maybe process the city name as a single token? On Mon, Sep 27, 2010 at 3:25 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: Hi, I have city name as a text field, and I want to do spellcheck on it. I use setting in http://wiki.apache.org/solr/SpellCheckComponent If I setup city name as text field and do spell check on San Jos for San Jose, I get suggestion for Jos as ojos. I checked the extendedresult and I found that Jose is in the middle of all 10 suggestions in term of score and frequency. I then set city name as string field, and spell check again, I got Van for San and Ross for Jos, which is weird because San is correct. How do you setup spellchecker to spellcheck city names? City name can have multiple words. Thanks.
Re: What's the difference between TokenizerFactory, Tokenizer, Analyzer?
1) KeywordTokenizerFactory seems to be a tokenizer factory while CJKTokenizer seems to be just a tokenizer. Are they the same type of things at all? Could I just replace tokenizer class=solr.KeywordTokenizerFactory/ with tokenizer class=org.apache.lucene.analysis.cjk.CJKTokenizer/ ?? You should use org.apache.solr.analysis.CJKTokenizerFactory instead. 2) I'm also interested in trying out SmartChineseAnalyzer (http://lucene.apache.org/java/2_9_0/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html) However SmartChineseAnalyzer doesn't offer a separate tokenizer. It's just an analyzer and that's it. How do I use it in Solr? You can use lucene analyzer directly in solr: fieldType name=chineese_text class=solr.TextField analyzer class=org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer/ /fieldType
Conditional Function Queries
Hi, Have anyone written any conditional functions yet for use in Function Queries? I see the use for a function which can run different sub functions depending on the value of a field. Say you have three documents: A: title=Sports car, color=red B: title=Boring car, color=green B: title=Big car, color=black Now we have a requirement to boost red cars over green and green cars over black. The only way I have found to do this today is (ab)using the map() function. DisMax syntax: q=carbf=sum(map(query($qr),0,0,0,100.0),map(query($qg),0,0,0,50.0))qr=color:redqg=color:green But I suspect this is expensive in terms of two sub queries being applied and scored. An elegant way to achieve the same would be through a new native if() or case() function: q=carbf=if(color==red; 100; if(color==green; 50; 0)) OR q=carbf=case(color, red:100, green:sum(30,20)) What do you think? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
Dismax Request handler and Solrconfig.xml
Hi, I am using Solr 1.4.1 with Nutch to index some of our intranet content. In Solrconfig.xml, default request handler is set to standard. I am planning to change that to use dismax as the request handler but when I set default=true for dismax - Solr does not return any results - I get results only when I comment out str name=defTypedismax/str. This works requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=fl*/str str name=qftitle^20.0 pagedescription^15.0/str str name=version2.1/str /lst /requestHandler DOES NOT WORK requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str THIS WORKS requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults !-- str name=defTypedismax/str -- str name=echoParamsexplicit/str Please let me know what I am doing wrong here. Sai Thumuluri Sr. Member - Application Staff IT Intranet Knoweldge Mgmt. Systems 614 560-8041 (Desk) 614 327-7200 (Mobile)
Re: Dismax Request handler and Solrconfig.xml
Are you removing the standard default requestHandler when you do this? Or are you specifying two requestHandler's with default=true ? -L On Tue, Sep 28, 2010 at 11:14 AM, Thumuluri, Sai sai.thumul...@verizonwireless.com wrote: Hi, I am using Solr 1.4.1 with Nutch to index some of our intranet content. In Solrconfig.xml, default request handler is set to standard. I am planning to change that to use dismax as the request handler but when I set default=true for dismax - Solr does not return any results - I get results only when I comment out str name=defTypedismax/str. This works requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=fl*/str str name=qftitle^20.0 pagedescription^15.0/str str name=version2.1/str /lst /requestHandler DOES NOT WORK requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str THIS WORKS requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults !-- str name=defTypedismax/str -- str name=echoParamsexplicit/str Please let me know what I am doing wrong here. Sai Thumuluri Sr. Member - Application Staff IT Intranet Knoweldge Mgmt. Systems 614 560-8041 (Desk) 614 327-7200 (Mobile)
RE: Dismax Request handler and Solrconfig.xml
I removed default=true from standard request handler -Original Message- From: Luke Crouch [mailto:lcro...@geek.net] Sent: Tuesday, September 28, 2010 12:50 PM To: solr-user@lucene.apache.org Subject: Re: Dismax Request handler and Solrconfig.xml Are you removing the standard default requestHandler when you do this? Or are you specifying two requestHandler's with default=true ? -L On Tue, Sep 28, 2010 at 11:14 AM, Thumuluri, Sai sai.thumul...@verizonwireless.com wrote: Hi, I am using Solr 1.4.1 with Nutch to index some of our intranet content. In Solrconfig.xml, default request handler is set to standard. I am planning to change that to use dismax as the request handler but when I set default=true for dismax - Solr does not return any results - I get results only when I comment out str name=defTypedismax/str. This works requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=fl*/str str name=qftitle^20.0 pagedescription^15.0/str str name=version2.1/str /lst /requestHandler DOES NOT WORK requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str THIS WORKS requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults !-- str name=defTypedismax/str -- str name=echoParamsexplicit/str Please let me know what I am doing wrong here. Sai Thumuluri Sr. Member - Application Staff IT Intranet Knoweldge Mgmt. Systems 614 560-8041 (Desk) 614 327-7200 (Mobile)
Re: Conditional Function Queries
On Tue, Sep 28, 2010 at 11:33 AM, Jan Høydahl / Cominvent jan@cominvent.com wrote: Have anyone written any conditional functions yet for use in Function Queries? Nope - but it makes sense and has been on my list of things to do for a long time. -Y http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
SolrException: Bad Request
Hi, I'm getting a rather strange exception after long web server idle (TomCat 7.0.2). If I immediately run the same request -- no errors are occurred. In what may be the problem? All server settings are defaults. Exception: my stack trace ... at sun.reflect.GeneratedMethodAccessor101.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:173) at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:89) at org.apache.cxf.jaxws.JAXWSMethodInvoker.invoke(JAXWSMethodInvoker.java:60) at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:75) at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at org.apache.cxf.workqueue.SynchronousExecutor.execute(SynchronousExecutor.java:37) at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:106) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:243) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:110) at org.apache.cxf.transport.servlet.ServletDestination.invoke(ServletDestination.java:98) at org.apache.cxf.transport.servlet.ServletController.invokeDestination(ServletController.java:423) at org.apache.cxf.transport.servlet.ServletController.invoke(ServletController.java:178) at org.apache.cxf.transport.servlet.AbstractCXFServlet.invoke(AbstractCXFServlet.java:142) at org.apache.cxf.transport.servlet.AbstractHTTPServlet.handleRequest(AbstractHTTPServlet.java:179) at org.apache.cxf.transport.servlet.AbstractHTTPServlet.doPost(AbstractHTTPServlet.java:103) at javax.servlet.http.HttpServlet.service(HttpServlet.java:641) at org.apache.cxf.transport.servlet.AbstractHTTPServlet.service(AbstractHTTPServlet.java:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:243) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:201) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:163) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:108) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:556) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:401) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:242) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:267) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:245) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:260) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at com.gramit.services.searching.SearchingService.search(SearchingService.java:186) ... 57 more Caused by: org.apache.solr.common.SolrException: Bad Request Bad Request request: http://127.0.0.1/solr/select?q=кофеhttp://127.0.0.1/solr/select?q=%D0%BA%D0%BE%D1%84%D0%B5fq=lat:[55.16728264288879 TO 56.437558186276114] AND lng:[36.47475305185914 TO 38.735977228049315]spellcheck=truespellcheck.count=1spellcheck.collate=truespellcheck.q=кофе start=0rows=10sort=dist(2,lat,lng,55.8076049,37.5869184) ascfacet=truefacet.limit=5facet.mincount=1facet.field=marketplaceCfg_idfacet.field=productCfg_idstats=truestats.field=pricewt=javabinversion=1 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 59 more Thanks. -- Pavel Minchenkov
Using separate Analyzers for querying and indexing.
Hello, I am migrating from a pure Lucene application to using solr. For legacy reasons I must support a somewhat obscure query feature: lowercase words in the query should match lowercase or uppercase in the index, while uppercase words in the query should only match uppercase words in the index. To do this with Lucene we created a custom Analyzer and custom TokenFilter. During indexing, the custom TokenFilter duplicates uppercase tokens as lowercase ones and sets their offsets to make them appear in same position as the upper case token, i.e., you get two tokens for every uppercase token. Then at query time a normal (case sensitive) analyzer is used so that lowercase tokens will match either upper or lower, while the uppercase will only match uppercase. I have looked through the documentation and I see how to specify the Analyzer in the schema.xml file that is used for indexing, but I don't know how to specify that a different Analyzer (the case sensitive one) should be used for queries. Is this possible? Thanks, James
Re: Using separate Analyzers for querying and indexing.
Yeah. You can specify two analyzers in the same fieldType: fieldType name=... class=... analyzer type=index ... /analyzer analyzer type=query ... /analyzer /fieldType -L On Tue, Sep 28, 2010 at 2:31 PM, James Norton jnor...@yellowbrix.comwrote: Hello, I am migrating from a pure Lucene application to using solr. For legacy reasons I must support a somewhat obscure query feature: lowercase words in the query should match lowercase or uppercase in the index, while uppercase words in the query should only match uppercase words in the index. To do this with Lucene we created a custom Analyzer and custom TokenFilter. During indexing, the custom TokenFilter duplicates uppercase tokens as lowercase ones and sets their offsets to make them appear in same position as the upper case token, i.e., you get two tokens for every uppercase token. Then at query time a normal (case sensitive) analyzer is used so that lowercase tokens will match either upper or lower, while the uppercase will only match uppercase. I have looked through the documentation and I see how to specify the Analyzer in the schema.xml file that is used for indexing, but I don't know how to specify that a different Analyzer (the case sensitive one) should be used for queries. Is this possible? Thanks, James
Re: Re:The search response time is too loong
Copy the index. Delete half of the documents. Optimize. Copy the index. Delete the other half of the documents. Optimize. 2010/9/28 newsam new...@zju.edu.cn: I guess you are correct. We used the default SOLR cache configuration. I will change the cache configuration. BTW, I want to deploy several shards from the existing 8G index file, such as 4G per shards. Is there any tool to generate two shards from one 8G index file? From: kenf_nc ken.fos...@realestate.com Reply-To: solr-user@lucene.apache.org To: solr-user@lucene.apache.org Subject: Re: Re:The search response time is too loong Date: Mon, 27 Sep 2010 05:37:25 -0700 (PDT) mem usage is over 400M, do you mean Tomcat mem size? If you don't give your cache sizes enough room to grow you will choke the performance. You should adjust your Tomcat settings to let the cache grow to at least 1GB or better would be 2GB. You may also want to look into http://wiki.apache.org/solr/SolrCaching warming the cache to make the first time call a little faster. For comparison, I also have about 8GB in my index but only 2.8 million documents. My search query times on a smaller box than you specify are 6533 milliseconds on an unwarmed (newly rebooted) instance. -- View this message in context: http://lucene.472066.n3.nabble.com/Re-The-search-response-time-is-too-loong-tp1587395p1588554.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Search Interface
There is already a simple Velocity app. Just hit http://localhost:8983/solr/browse. You can configure some handy parameters to make walkable facets in solrconfig.xml. On Tue, Sep 28, 2010 at 5:23 AM, Antonio Calo' anton.c...@gmail.com wrote: Hi You could try to use the Velocity framework to build GUIs in a quick and efficent manner. Solr come with a velocity handler already integrated that could be the best solution in your case: http://wiki.apache.org/solr/VelocityResponseWriter Also take these hints on the same topic: http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/ there is also a webinar about rapid prototyping with solr: http://www.slideshare.net/erikhatcher/rapid-prototyping-with-solr-4312681 Hope this help Antonio Il 28/09/2010 4.35, Claudio Devecchi ha scritto: Hi everybody, I`m implementing my first solr engine for conceptual tests, I`m crawling my wiki intranet to make some searches, the engine is working fine already, but I need some interface to make my searchs. Somebody knows where can I find some search interface just for customizations? Tks -- Lance Norskog goks...@gmail.com
Best way to check Solr index for completeness
Hello, What would be the best way to check Solr index against original system (Database) to make sure index is up to date? I can use Solr fields like Id and timestamp to check against appropriate fields in database. Our index currently contains over 2 mln documents across several cores. Pulling all documents from Solr index via search (1000 docs at a time) is very slow. Is there a better way to do it? Thanks, Dmitriy
Re: Best way to check Solr index for completeness
Is there a 1:1 ratio of db records to solr documents? If so, couldn't you simply select the most recent updated record from the db and check to make sure the corresponding solr doc has the same timestamp? -L On Tue, Sep 28, 2010 at 3:48 PM, Dmitriy Shvadskiy dshvads...@gmail.comwrote: Hello, What would be the best way to check Solr index against original system (Database) to make sure index is up to date? I can use Solr fields like Id and timestamp to check against appropriate fields in database. Our index currently contains over 2 mln documents across several cores. Pulling all documents from Solr index via search (1000 docs at a time) is very slow. Is there a better way to do it? Thanks, Dmitriy
Re: Concurrent access to EmbeddedSolrServer
we learned it hard way, Wish I had read this before http://wiki.apache.org/solr/EmbeddedSolr it is not threadsafe. start seeing concurrent modification exception as soon as within 100 Samples, when you load it with more than 1 Concurrent Users ( I have tested it using jmeter) best, Reuben On 12/9/2009 12:47 PM, Jon Poulton wrote: Hi there, I'm about to start implementing some code which will access a Solr instance via a ThreadPool concurrently. I've been looking at the solrj API docs ( particularly http://lucene.apache.org/solr/api/index.html?org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.html ) and I just want to make sure what I have in mind makes sense. The Javadoc is a bit sparse, so I thought I'd ask a couple of questions here. 1) I'm assuming that EmbeddedSolrServer can be accessed concurrently by several threads at once for add, delete and query operations (on the SolrServer parent interface). Is that right? I don't have to enforce single-threaded access? 2) What happens if multiple threads simultaneously call commit? 3) What happens if multiple threads simultaneously call optimize? 4) Both commit and optimise have optional parameters called "waitFlush" and "waitSearcher". These are undocumented in the Javadoc. What do they signify? Thanks in advance for any help. Cheers Jon --
Re: Best way to check Solr index for completeness
That will certainly work for most recent updates but I need to compare entire index. Dmitriy Luke Crouch wrote: Is there a 1:1 ratio of db records to solr documents? If so, couldn't you simply select the most recent updated record from the db and check to make sure the corresponding solr doc has the same timestamp? -L On Tue, Sep 28, 2010 at 3:48 PM, Dmitriy Shvadskiy dshvads...@gmail.comwrote: Hello, What would be the best way to check Solr index against original system (Database) to make sure index is up to date? I can use Solr fields like Id and timestamp to check against appropriate fields in database. Our index currently contains over 2 mln documents across several cores. Pulling all documents from Solr index via search (1000 docs at a time) is very slow. Is there a better way to do it? Thanks, Dmitriy -- View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-check-Solr-index-for-completeness-tp1598626p1598733.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: is EmbeddedSolrServer thread safe ?
No it is not same for EmbeddedSolrServer, we learned it hard way, I guess you would have also learned it by now. at SolrJ wiki page : http://wiki.apache.org/solr/Solrj#EmbeddedSolrServer CommonsHttpSolrServer is thread-safe and if you are using the following constructor, you *MUST* re-use the same instance for all requests. ... But is it the same for EmbeddedSolrServer ? Best regards Jean-François -- Reuben Christie -^- °v° /(_)\ ^ ^
RE: Dismax Request handler and Solrconfig.xml
Can I please get some help here? I am in a tight timeline to get this done - any ideas/suggestions would be greatly appreciated. -Original Message- From: Thumuluri, Sai [mailto:sai.thumul...@verizonwireless.com] Sent: Tuesday, September 28, 2010 12:15 PM To: solr-user@lucene.apache.org Subject: Dismax Request handler and Solrconfig.xml Importance: High Hi, I am using Solr 1.4.1 with Nutch to index some of our intranet content. In Solrconfig.xml, default request handler is set to standard. I am planning to change that to use dismax as the request handler but when I set default=true for dismax - Solr does not return any results - I get results only when I comment out str name=defTypedismax/str. This works requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=fl*/str str name=qftitle^20.0 pagedescription^15.0/str str name=version2.1/str /lst /requestHandler DOES NOT WORK requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str THIS WORKS requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults !-- str name=defTypedismax/str -- str name=echoParamsexplicit/str
Re: Dismax Request handler and Solrconfig.xml
What you have is exactly what I have on 1.4.0: requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str And it has worked fine. We copied our solrconfig.xml from the examples and changed them for our purposes. You might compare your solrconfig.xml to some of the examples. -L On Tue, Sep 28, 2010 at 4:19 PM, Thumuluri, Sai sai.thumul...@verizonwireless.com wrote: Can I please get some help here? I am in a tight timeline to get this done - any ideas/suggestions would be greatly appreciated. -Original Message- From: Thumuluri, Sai [mailto:sai.thumul...@verizonwireless.com] Sent: Tuesday, September 28, 2010 12:15 PM To: solr-user@lucene.apache.org Subject: Dismax Request handler and Solrconfig.xml Importance: High Hi, I am using Solr 1.4.1 with Nutch to index some of our intranet content. In Solrconfig.xml, default request handler is set to standard. I am planning to change that to use dismax as the request handler but when I set default=true for dismax - Solr does not return any results - I get results only when I comment out str name=defTypedismax/str. This works requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=fl*/str str name=qftitle^20.0 pagedescription^15.0/str str name=version2.1/str /lst /requestHandler DOES NOT WORK requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str THIS WORKS requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults !-- str name=defTypedismax/str -- str name=echoParamsexplicit/str
Re: Dismax Request handler and Solrconfig.xml
I notice we don't have the default=true, instead we manually specify qt=dismax in our queries. HTH. -L On Tue, Sep 28, 2010 at 4:24 PM, Luke Crouch lcro...@geek.net wrote: What you have is exactly what I have on 1.4.0: requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str And it has worked fine. We copied our solrconfig.xml from the examples and changed them for our purposes. You might compare your solrconfig.xml to some of the examples. -L On Tue, Sep 28, 2010 at 4:19 PM, Thumuluri, Sai sai.thumul...@verizonwireless.com wrote: Can I please get some help here? I am in a tight timeline to get this done - any ideas/suggestions would be greatly appreciated. -Original Message- From: Thumuluri, Sai [mailto:sai.thumul...@verizonwireless.com] Sent: Tuesday, September 28, 2010 12:15 PM To: solr-user@lucene.apache.org Subject: Dismax Request handler and Solrconfig.xml Importance: High Hi, I am using Solr 1.4.1 with Nutch to index some of our intranet content. In Solrconfig.xml, default request handler is set to standard. I am planning to change that to use dismax as the request handler but when I set default=true for dismax - Solr does not return any results - I get results only when I comment out str name=defTypedismax/str. This works requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=fl*/str str name=qftitle^20.0 pagedescription^15.0/str str name=version2.1/str /lst /requestHandler DOES NOT WORK requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str THIS WORKS requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults !-- str name=defTypedismax/str -- str name=echoParamsexplicit/str
Solr Deduplication and Field Collpasing
All, I have setup Nutch to submit the crawl results to Solr index. I have some duplicates in the documents generated by the Nutch crawl. There is filed 'digest' that Nutch generates that is same for those documents that are duplicates. While setting up the the dedupe processor in the Solr config file, I have used this 'Digest' field in the following way(see below for config details). Since my index has documents other than the ones generated by Nutch I cannot use 'overwritedupes=true because for non-Nutch generated documents the digest field will not be populated and I found that Solr deletes every one of those documents that do not have the digest filed populated. Probably because they all will have the same 'sig' filed value generated based on an 'empty' digest field forcing Solr to delete everything? In any case, given the scenario I though I would set 'overwritedupes=false' and use field collapsing based on digest or sig filed but I could not get filed collapsing to work. Based on the wiki documentation I was adding the query string group=truegroup.filed=sig group=truegroup.filed=digest to my over all query in admin console and I still got the duplicate documents in the results. Is there anything special I need to do to get field collapsing working? I am running Solr 1.4. All this is because Nutch thinks that (url *is* the unique id for the nutch document) http://mysite.mydomain.com/index.html and http://mysite/index.html (the difference is only in the alias and for an internal site both are valid) are different documents depending on how the link is setup. This is reason for me to try deduplication. I cannot submit SolrDedup command from Nutch because non-Nutch generated documents do not have digest filed populated and I read on the mailing lists that this will cause the SolrDedup initiated from Nutch to fail. This forced me to do try deduplication on Solr side. Thanks so much in advance for your help. Here is my configuration: SolrConfig.xml updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupesfalse/bool str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature /str str name=fieldsdigest/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler Schema.xml field name=sig type=string stored=true indexed=true multiValued=true / Thanks so much for your help
Re: Using separate Analyzers for querying and indexing.
Excellent, exactly what I needed. Thanks, James On Sep 28, 2010, at 4:28 PM, Luke Crouch wrote: Yeah. You can specify two analyzers in the same fieldType: fieldType name=... class=... analyzer type=index ... /analyzer analyzer type=query ... /analyzer /fieldType -L On Tue, Sep 28, 2010 at 2:31 PM, James Norton jnor...@yellowbrix.comwrote: Hello, I am migrating from a pure Lucene application to using solr. For legacy reasons I must support a somewhat obscure query feature: lowercase words in the query should match lowercase or uppercase in the index, while uppercase words in the query should only match uppercase words in the index. To do this with Lucene we created a custom Analyzer and custom TokenFilter. During indexing, the custom TokenFilter duplicates uppercase tokens as lowercase ones and sets their offsets to make them appear in same position as the upper case token, i.e., you get two tokens for every uppercase token. Then at query time a normal (case sensitive) analyzer is used so that lowercase tokens will match either upper or lower, while the uppercase will only match uppercase. I have looked through the documentation and I see how to specify the Analyzer in the schema.xml file that is used for indexing, but I don't know how to specify that a different Analyzer (the case sensitive one) should be used for queries. Is this possible? Thanks, James
Re: Conditional Function Queries
Ok, I created the issues: IF function: SOLR-2136 AND, OR, NOT: SOLR-2137 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 28. sep. 2010, at 19.36, Yonik Seeley wrote: On Tue, Sep 28, 2010 at 11:33 AM, Jan Høydahl / Cominvent jan@cominvent.com wrote: Have anyone written any conditional functions yet for use in Function Queries? Nope - but it makes sense and has been on my list of things to do for a long time. -Y http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
multiple local indexes
In our application, we need to be able to search across multiple local indexes. We need this not so much for performance reasons, but because of the particular needs of our project. But the indexes, while sharing the same schema can be vary different in terms of size and distribution of documents. By that I mean that some indexes may have a lot more documents about some topic while others will have more documents about other topics. We want to be able add documents to the individual indexes as well. I can provide more detail about our project is necessary. Thus, the Distributed Search feature with shards in different cores seems to be an obvious solution except for the limitation of distributed idf. First, I want to make sure my understanding about the distributed idf limitation are correct: If your documents are spread across your shards evenly, then the distribution of terms across the individual shards can be assumed to be even enough not to matter. If, as in our case, the shards are not very uniform, then this limitation is magnified. Even though simplistic, do I have the basic idea? We have hacked together something that allows us to read from multiple indexes, but it isn't really a long-term solution. It's just sort of shoe-horned in there. Here are some notes from the programmer who worked on this: Two custom files: EgranaryIndexReaderFactory.java and EgranaryIndexReader.java EgranaryIndexReader.java No real work is done here. This class extends lucene.index.MultiReader and overrides the directory() and getVersion() methods inherited from IndexReader. These methods don't make sense for a MultiReader as they only return a single value. However, Solr expects Readers to have these methods. directory() was overridden to return a call to directory() on the first reader in the subreader list. The same was done for getVersion(). This hack makes any use of these methods by Solr somewhat pointless. EgranaryIndexReaderFactory.java Overrides the newReader(Directory indexDir, boolean readOnly) method The expected behavior of this method is to construct a Reader from the index at indexDir. However, this method ignores indexDir and reads a list of indexDirs from the solrconfig.xml file. These indices are used to create a list of lucene.index.IndexReader classes. This list is then used to create the EgranaryIndexReader. So the second questions is: Does anybody have other ideas about how we might solve this problem? Is distributed search still our best bet? Thanks for your thoughts! Brent
RE: Solr Deduplication and Field Collpasing
You could create a custom update processor that adds a digest field for newly added documents that do not have the digest field themselves. This way, the documents that are not added by Nutch get a proper non-empty digest field so the deduplication processor won't create the same empty hash and overwrite those. Or you could extend org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips documents with an empty digest field. I'd think the latter would be the quickest route but correct me if i'm wrong. Cheers, -Original message- From: Nemani, Raj raj.nem...@turner.com Sent: Tue 28-09-2010 23:28 To: solr-user@lucene.apache.org; Subject: Solr Deduplication and Field Collpasing All, I have setup Nutch to submit the crawl results to Solr index. I have some duplicates in the documents generated by the Nutch crawl. There is filed 'digest' that Nutch generates that is same for those documents that are duplicates. While setting up the the dedupe processor in the Solr config file, I have used this 'Digest' field in the following way(see below for config details). Since my index has documents other than the ones generated by Nutch I cannot use 'overwritedupes=true because for non-Nutch generated documents the digest field will not be populated and I found that Solr deletes every one of those documents that do not have the digest filed populated. Probably because they all will have the same 'sig' filed value generated based on an 'empty' digest field forcing Solr to delete everything? In any case, given the scenario I though I would set 'overwritedupes=false' and use field collapsing based on digest or sig filed but I could not get filed collapsing to work. Based on the wiki documentation I was adding the query string group=truegroup.filed=sig group=truegroup.filed=digest to my over all query in admin console and I still got the duplicate documents in the results. Is there anything special I need to do to get field collapsing working? I am running Solr 1.4. All this is because Nutch thinks that (url *is* the unique id for the nutch document) http://mysite.mydomain.com/index.html and http://mysite/index.html (the difference is only in the alias and for an internal site both are valid) are different documents depending on how the link is setup. This is reason for me to try deduplication. I cannot submit SolrDedup command from Nutch because non-Nutch generated documents do not have digest filed populated and I read on the mailing lists that this will cause the SolrDedup initiated from Nutch to fail. This forced me to do try deduplication on Solr side. Thanks so much in advance for your help. Here is my configuration: SolrConfig.xml updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupesfalse/bool str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature /str str name=fieldsdigest/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler Schema.xml field name=sig type=string stored=true indexed=true multiValued=true / Thanks so much for your help
RE: multiple local indexes
Honestly, I think just putting everything in the same index is your best bet. Are you sure your particular needs of your project can't be served by one combined index? You can certainly still query on just a portion of the index when needed using fq -- you can even create a request handler (or multiple request handlers) with invariant or appends to force that all queries through that request handler have a fixed fq. From: Brent Palmer [br...@widernet.org] Sent: Tuesday, September 28, 2010 6:04 PM To: solr-user@lucene.apache.org Subject: multiple local indexes In our application, we need to be able to search across multiple local indexes. We need this not so much for performance reasons, but because of the particular needs of our project. But the indexes, while sharing the same schema can be vary different in terms of size and distribution of documents. By that I mean that some indexes may have a lot more documents about some topic while others will have more documents about other topics. We want to be able add documents to the individual indexes as well. I can provide more detail about our project is necessary. Thus, the Distributed Search feature with shards in different cores seems to be an obvious solution except for the limitation of distributed idf. First, I want to make sure my understanding about the distributed idf limitation are correct: If your documents are spread across your shards evenly, then the distribution of terms across the individual shards can be assumed to be even enough not to matter. If, as in our case, the shards are not very uniform, then this limitation is magnified. Even though simplistic, do I have the basic idea? We have hacked together something that allows us to read from multiple indexes, but it isn't really a long-term solution. It's just sort of shoe-horned in there. Here are some notes from the programmer who worked on this: Two custom files: EgranaryIndexReaderFactory.java and EgranaryIndexReader.java EgranaryIndexReader.java No real work is done here. This class extends lucene.index.MultiReader and overrides the directory() and getVersion() methods inherited from IndexReader. These methods don't make sense for a MultiReader as they only return a single value. However, Solr expects Readers to have these methods. directory() was overridden to return a call to directory() on the first reader in the subreader list. The same was done for getVersion(). This hack makes any use of these methods by Solr somewhat pointless. EgranaryIndexReaderFactory.java Overrides the newReader(Directory indexDir, boolean readOnly) method The expected behavior of this method is to construct a Reader from the index at indexDir. However, this method ignores indexDir and reads a list of indexDirs from the solrconfig.xml file. These indices are used to create a list of lucene.index.IndexReader classes. This list is then used to create the EgranaryIndexReader. So the second questions is: Does anybody have other ideas about how we might solve this problem? Is distributed search still our best bet? Thanks for your thoughts! Brent
Re: Solr Deduplication and Field Collpasing
I have the digest field already in the schema because the index is shared between nutch docs and others. I do not know if the second approach is the quickest in my case. I can set the digest value to something unique for non nutch documets easily (I have an I'd field that I can use to populate the digest field during indxing of new non_nutch documets. I have custom tool that does the indexing of these docs). But I have more than3 millon documents in the index already that I don't want start over with new indexing again if I don't have to. Is there a way I can update the digest field with the value from the corresponding I'd field using solr? Thanks Raj - Original Message - From: Markus Jelsma markus.jel...@buyways.nl To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tue Sep 28 18:19:17 2010 Subject: RE: Solr Deduplication and Field Collpasing You could create a custom update processor that adds a digest field for newly added documents that do not have the digest field themselves. This way, the documents that are not added by Nutch get a proper non-empty digest field so the deduplication processor won't create the same empty hash and overwrite those. Or you could extend org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips documents with an empty digest field. I'd think the latter would be the quickest route but correct me if i'm wrong. Cheers, -Original message- From: Nemani, Raj raj.nem...@turner.com Sent: Tue 28-09-2010 23:28 To: solr-user@lucene.apache.org; Subject: Solr Deduplication and Field Collpasing All, I have setup Nutch to submit the crawl results to Solr index. I have some duplicates in the documents generated by the Nutch crawl. There is filed 'digest' that Nutch generates that is same for those documents that are duplicates. While setting up the the dedupe processor in the Solr config file, I have used this 'Digest' field in the following way(see below for config details). Since my index has documents other than the ones generated by Nutch I cannot use 'overwritedupes=true because for non-Nutch generated documents the digest field will not be populated and I found that Solr deletes every one of those documents that do not have the digest filed populated. Probably because they all will have the same 'sig' filed value generated based on an 'empty' digest field forcing Solr to delete everything? In any case, given the scenario I though I would set 'overwritedupes=false' and use field collapsing based on digest or sig filed but I could not get filed collapsing to work. Based on the wiki documentation I was adding the query string group=truegroup.filed=sig group=truegroup.filed=digest to my over all query in admin console and I still got the duplicate documents in the results. Is there anything special I need to do to get field collapsing working? I am running Solr 1.4. All this is because Nutch thinks that (url *is* the unique id for the nutch document) http://mysite.mydomain.com/index.html and http://mysite/index.html (the difference is only in the alias and for an internal site both are valid) are different documents depending on how the link is setup. This is reason for me to try deduplication. I cannot submit SolrDedup command from Nutch because non-Nutch generated documents do not have digest filed populated and I read on the mailing lists that this will cause the SolrDedup initiated from Nutch to fail. This forced me to do try deduplication on Solr side. Thanks so much in advance for your help. Here is my configuration: SolrConfig.xml updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupesfalse/bool str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature /str str name=fieldsdigest/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler Schema.xml
Re: Re:The search response time is too loong
Thx. I will let you know the latest status. From: Lance Norskog goks...@gmail.com Reply-To: solr-user@lucene.apache.org To: solr-user@lucene.apache.org, newsam new...@zju.edu.cn Subject: Re: Re:The search response time is too loong Date: Tue, 28 Sep 2010 13:34:53 -0700 Copy the index. Delete half of the documents. Optimize. Copy the index. Delete the other half of the documents. Optimize. 2010/9/28 newsam : I guess you are correct. We used the default SOLR cache configuration. I will change the cache configuration. BTW, I want to deploy several shards from the existing 8G index file, such as 4G per shards. Is there any tool to generate two shards from one 8G index file? From: kenf_nc Reply-To: solr-user@lucene.apache.org To: solr-user@lucene.apache.org Subject: Re: Re:The search response time is too loong Date: Mon, 27 Sep 2010 05:37:25 -0700 (PDT) mem usage is over 400M, do you mean Tomcat mem size? If you don't give your cache sizes enough room to grow you will choke the performance. You should adjust your Tomcat settings to let the cache grow to at least 1GB or better would be 2GB. You may also want to look into http://wiki.apache.org/solr/SolrCaching warming the cache to make the first time call a little faster. For comparison, I also have about 8GB in my index but only 2.8 million documents. My search query times on a smaller box than you specify are 6533 milliseconds on an unwarmed (newly rebooted) instance. -- View this message in context: http://lucene.472066.n3.nabble.com/Re-The-search-response-time-is-too-loong-tp1587395p1588554.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Best way to check Solr index for completeness
Have you looked at SOLRs TermComponent? Assuming you have a unique key, I think you could use TermsComponent to walk that field for comparing against your database rather then getting all the documents. HTH Erick On Tue, Sep 28, 2010 at 5:11 PM, dshvadskiy dshvads...@gmail.com wrote: That will certainly work for most recent updates but I need to compare entire index. Dmitriy Luke Crouch wrote: Is there a 1:1 ratio of db records to solr documents? If so, couldn't you simply select the most recent updated record from the db and check to make sure the corresponding solr doc has the same timestamp? -L On Tue, Sep 28, 2010 at 3:48 PM, Dmitriy Shvadskiy dshvads...@gmail.comwrote: Hello, What would be the best way to check Solr index against original system (Database) to make sure index is up to date? I can use Solr fields like Id and timestamp to check against appropriate fields in database. Our index currently contains over 2 mln documents across several cores. Pulling all documents from Solr index via search (1000 docs at a time) is very slow. Is there a better way to do it? Thanks, Dmitriy -- View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-check-Solr-index-for-completeness-tp1598626p1598733.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to tell whether a plugin is loaded?
: then in method createParser() add the following: : : req.getCore().getInfoRegistry().put(getName(), this); that doesn't seem like a good idea -- createParser will be called every time a string needs to be parsed, you're overwriting the same entry in the infoRegistry over and over and over again. I would just put that logic in your init() method (make sure to put the QParserPlugin in the registry, not the individual QParser instances) : I wonder though whether it'd be useful if Solr QParserPlugin did : implement SolrInfoMBean by default already... I agree ... i think that was an oversight when QParser was added. There's an open issue for it, but no one has had a chance to get arround to it yet... https://issues.apache.org/jira/browse/SOLR-1428 -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Why the query performance is so different for queries?
Hi guys, I have posted a thread The search response time is too long. The SOLR searcher instance is deployed with Tomcat 5.5.21. . The index file is 8.2G. The doc num is 6110745. DELL Server has Intel(R) Xeon(TM) CPU (4 cores) 3.00GHZ and 6G RAM. In SOLR back-end, query=key:* costs almost 60s while query=*:* only needs 500ms. Another case is query=product_name_title:*, which costs 7s. I am confused about the query performance. Do you have any suggestions? btw, the cache setting is as follows: filterCache: 256, 256, 0 queryResultCache: 1024, 512, 128 documentCache: 16384, 4096, n/a Thanks.
Solr with example Jetty and score problem
Hi there I have a problem, the situation is when I issue a query to single instance, Solr response XML like following as you can see, the score is normal(float name=score value=...) === response lst name=responseHeader int name=status0/int int name=QTime23/int lst name=params str name=fl_l_title,score/str str name=start0/str str name=q_l_unique_key:12/str str name=hl.fl*/str str name=hltrue/str str name=rows999/str /lst /lst result name=response numFound=1 start=0 maxScore=1.9808292 doc float name=score1.9808292/float str name=_l_titleGTest/str /doc /result lst name=highlighting lst name=12 arr name=_l_unique_key strem12/em/str /arr /lst /lst /response === But when I issue the query with shard(two instances), the response XML will be like following. as you can see, that score has bee tranfer to a element arr of doc === response lst name=responseHeader int name=status0/int int name=QTime64/int lst name=params str name=shardslocalhost:8983/solr/core0,172.16.6.35:8983/solr/str str name=fl_l_title,score/str str name=start0/str str name=q_l_unique_key:12/str str name=hl.fl*/str str name=hltrue/str str name=rows999/str /lst /lst result name=response numFound=1 start=0 maxScore=1.9808292 doc str name=_l_titleGtest/str arr name=score float name=score1.9808292/float /arr /doc /result lst name=highlighting lst name=12 arr name=_l_unique_key strem12/em/str /arr /lst /lst /response === My Schema.xml like following field name=_l_unique_key type=string indexed=true stored=true required=true omitNorms=true/ field name=_l_read_permission type=string indexed=true stored=true omitNorms=true multiValued=true/ field name=_l_title type=text indexed=true stored=true omitNorms=false termVectors=true termPositions=true termOffsets=true/ field name=_l_summary type=text indexed=true stored=true omitNorms=false termVectors=true termPositions=true termOffsets=true/ field name=_l_body type=text indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true omitNorms=false/ dynamicField name=* type=text indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true omitNorms=false/ /fields uniqueKey_l_unique_key/uniqueKey defaultSearchField_l_body/defaultSearchField I don't really know what happended. Is my schema problem or is the behavior of Solr? please help on this.
Re: multiple local indexes
Thanks for your comments, Jonathon. Here is some information that gives a brief overview of the eGranary Platform in order to quickly outline the need for a solution for bringing multiple indexes into one searchable collection. http://www.widernet.org/egranary/info/multipleIndexes Thanks, Brent On 9/28/2010 5:40 PM, Jonathan Rochkind wrote: Honestly, I think just putting everything in the same index is your best bet. Are you sure your particular needs of your project can't be served by one combined index? You can certainly still query on just a portion of the index when needed using fq -- you can even create a request handler (or multiple request handlers) with invariant or appends to force that all queries through that request handler have a fixed fq. From: Brent Palmer [br...@widernet.org] Sent: Tuesday, September 28, 2010 6:04 PM To: solr-user@lucene.apache.org Subject: multiple local indexes In our application, we need to be able to search across multiple local indexes. We need this not so much for performance reasons, but because of the particular needs of our project. But the indexes, while sharing the same schema can be vary different in terms of size and distribution of documents. By that I mean that some indexes may have a lot more documents about some topic while others will have more documents about other topics. We want to be able add documents to the individual indexes as well. I can provide more detail about our project is necessary. Thus, the Distributed Search feature with shards in different cores seems to be an obvious solution except for the limitation of distributed idf. First, I want to make sure my understanding about the distributed idf limitation are correct: If your documents are spread across your shards evenly, then the distribution of terms across the individual shards can be assumed to be even enough not to matter. If, as in our case, the shards are not very uniform, then this limitation is magnified. Even though simplistic, do I have the basic idea? We have hacked together something that allows us to read from multiple indexes, but it isn't really a long-term solution. It's just sort of shoe-horned in there. Here are some notes from the programmer who worked on this: Two custom files: EgranaryIndexReaderFactory.java and EgranaryIndexReader.java EgranaryIndexReader.java No real work is done here. This class extends lucene.index.MultiReader and overrides the directory() and getVersion() methods inherited from IndexReader. These methods don't make sense for a MultiReader as they only return a single value. However, Solr expects Readers to have these methods. directory() was overridden to return a call to directory() on the first reader in the subreader list. The same was done for getVersion(). This hack makes any use of these methods by Solr somewhat pointless. EgranaryIndexReaderFactory.java Overrides the newReader(Directory indexDir, boolean readOnly) method The expected behavior of this method is to construct a Reader from the index at indexDir. However, this method ignores indexDir and reads a list of indexDirs from the solrconfig.xml file. These indices are used to create a list of lucene.index.IndexReader classes. This list is then used to create the EgranaryIndexReader. So the second questions is: Does anybody have other ideas about how we might solve this problem? Is distributed search still our best bet? Thanks for your thoughts! Brent