How much data can Solr handle?
We're looking to build a search solution that can contain as many as 10 million different items and I was wondering if Solr could handle that kind of data amount or not? Has anybody done any testing or published any kind of results for a Solr-installation working on huge amounts of data like this? //Daniel -- Daniel Löfquist Software Engineer CDON.COM Bergsgatan 20, Box 385, SE 201 23 Malmö, Sweden Office: +46 40 601 61 00 Direct: +46 40 601 61 16 Fax: +46 40 601 61 20 E-mail: daniel.lofqu...@it.cdon.com <mailto:daniel.lofqu...@it.cdon.com> CDON.COM <http://www.cdon.com/> Confidentiality Information contained in this e-mail is intended for the use of the addressee only, and is confidential. Any dissemination, distribution, copying or use of this communication without prior permission of the addressee is strictly prohibited. If you are not the intended addressee you must delete this e-mail and its attachments.
Group by field in Solr
Hello, I'm trying to accomplish something akin to "GROUP BY" in SQL but in Solr. I have an index full of songs (one song per document in Solr) by various artists and I would like to construct a search that gives me all of the artists back, one row per artist. The current search returns one row per artist and song. So I get this right now if I search after "war" in the artist-field: 30 Years War - Ideal Means 30 Years War - Dirty Castle 30 Years War - Misinformed All Out War - Soaked In Torment All Out War - Claim Your Innocence Audio War - Negativity Audio War - One Drug Audio War - Super Freak But this is what I'd really like: 30 Years War - whatever song All Out War - whatever song Audio War - whatever song I tried using facets but couldn't get it to work properly. Anybody have a clue how to do something like this? //Daniel
No search hits for items starting with one-letter words
Hello all, I have an odd problem. I have a Solr-index containing songs by various artists. When I perform a search for something that starts with a one-letter word I receive no hits. If I remove the one-letter word I get hits though. So for example, if I search for "a hard days night" or "i want you back" I get 0 hits but if I search for "hard days night" or "want you back" there are hits. This behaviour doesn't affect items starting with a number. So if a song-title were to start with a number that's no problem, I will get hits for that. The fieldtype I'm using for the text-field containing song-title is defined in my schema.xml like this: Can anyone tell me what may be the source of my problem and how to fix it? I'm on a deadline so quick answers are greatly appreciated ;-) Thanks for listening, //Daniel
Solr interprets UTF-8 as ISO-8859-1
Hello, We're building a webapplication that uses Solr for searching and I've come upon a problem that I can't seem to get my head around. We have a servlet that accepts input via XML-RPC and based on that input constructs the correct URL to perform a search with the Solr-servlet. I know that the call to Solr (the URL) from our servlet looks like this (which is what it should look like): http://myserver:8080/solrproducts/select/?q=all_SV:ljusblå+status:online&fl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2C&sort=titleSort_SV+asc,id+asc&start=0&q.op=AND&rows=25 But Solr reports the input-fields (the GET-variables in the URL) as: INFO: /select/ fl=id,artno,title_SV,titleSort_SV,description_SV,&sort=titleSort_SV+asc,id+asc&start=0&q=all_SV:ljusblÃ¥+status:online&q.op=AND&rows=25 which is all fine except where it says "ljusblÃ¥". Apparently Solr is interpreting the UTF-8 string "ljusblå" as ISO-8859-1 and thus creates this garbage that makes the search return 0 when it should in reality return 3 hits. All other searches that don't use special characters work 100% fine. I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody help me out and point me in the direction of a solution? Sincerely, Daniel Löfquist
Re: Solved! Solr interprets UTF-8 as ISO-8859-1
That did the trick. I actually figured it out on my own 10 minutes after I posted to the mailinglist. Typical ;-) Thanks for the help anyway everybody! //Daniel Uwe Klosa wrote: You should set uriEncoding="UTF-8" in your application server. For tomcat you can do that in the server.xml. For Glassfish you have to create a sun-web.xml containing the according parameters. Yoy r application server should provide a similar mechanism. Uwe On Mon, Mar 31, 2008 at 4:32 PM, Daniel Löfquist < [EMAIL PROTECTED]> wrote: Hello, We're building a webapplication that uses Solr for searching and I've come upon a problem that I can't seem to get my head around. We have a servlet that accepts input via XML-RPC and based on that input constructs the correct URL to perform a search with the Solr-servlet. I know that the call to Solr (the URL) from our servlet looks like this (which is what it should look like): http://myserver:8080/solrproducts/select/?q=all_SV:ljusbl å+status:online&fl=id%2Cartno%2Ctitle_SV%2CtitleSort_SV%2Cdescription_SV%2C&sort=titleSort_SV+asc,id+asc&start=0&q.op=AND&rows=25 But Solr reports the input-fields (the GET-variables in the URL) as: INFO: /select/ fl=id,artno,title_SV,titleSort_SV,description_SV,&sort=titleSort_SV+asc,id+asc&start=0&q=all_SV:ljusblÃ¥+status:online&q.op=AND&rows=25 which is all fine except where it says "ljusblÃ¥". Apparently Solr is interpreting the UTF-8 string "ljusblå" as ISO-8859-1 and thus creates this garbage that makes the search return 0 when it should in reality return 3 hits. All other searches that don't use special characters work 100% fine. I'm new to Solr so I'm not sure what I'm doing wrong here. Can anybody help me out and point me in the direction of a solution? Sincerely, Daniel Löfquist -- Daniel Löfquist Application Manager / Software Engineer CDON.COM Bergsgatan 20, Box 385, SE 201 23 Malmö, Sweden Office: +46 40 601 61 00 Direct: +46 40 601 61 16 Mobile: +46 702 92 21 75 Fax: +46 40 601 61 20 E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> CDON.COM <http://www.cdon.com/> Confidentiality Information contained in this e-mail is intended for the use of the addressee only, and is confidential. Any dissemination, distribution, copying or use of this communication without prior permission of the addressee is strictly prohibited. If you are not the intended addressee you must delete this e-mail and its attachments.
Searching "inside of words"
Hi, I'm still pretty new to Solr. We're using it for searching on our site right now though. The configuration is however pretty much based on the example-files that come with Solr and there's one type of search that I can't get to work. Each item has fields called "title" and "description", both of which are of type "text". The type "text" is defined like this in our schema.xml : words="stopwords.txt"/> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> ignoreCase="true" expand="true"/> words="stopwords.txt"/> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> My problem is that if I have an item with "title"="Termobyxa", a search for "Termo" gives me a hit but if I search for "ermo" or "byxa" I get no hit. How do I make it so that this kind of search "inside a word" returns a hit? Sincerely, Daniel Löfquist
Re: Searching "inside of words"
Sorry for taking forever to reply but anyway... We're using Solr-1.2.0 and can't for various reasons use the Nightly-version. The 1.2.0-version doesn't have NGramFilterFactory and EdgeNGramFilterFactory so the only ones I can utilize are EdgeNGramTokenizerFactory and NGramTokenizerFactory. I've done some playing around with them but the best result I've gotten so far is a field-type that enables searching for specific letters, for example I can search for an item that contains the letters a and x, but it returns a hit no matter where these letters are in the text, they don't have to be next to each other, and that's not the result I was going for. If the field contains "monitor" I want a hit on a search for "onit" but not on "rint" for example. I've never attempted to construct a new field-type of my own before and I'm finding the available documentation somewhat incomplete and not very helpful so I really need some pointers from people who know better than me here. If anyone could help me out maybe even with some example-code I'd be eternally grateful. //Daniel Otis Gospodnetic wrote: Hi Daniel, Well, searching "inside of words" requires special treatment, because normally searches work on words/terms/tokens. Make use of the following: $ ff \*NGram\*java ./src/java/org/apache/solr/analysis/EdgeNGramTokenizerFactory.java ./src/java/org/apache/solr/analysis/NGramTokenizerFactory.java ./src/java/org/apache/solr/analysis/NGramFilterFactory.java ./src/java/org/apache/solr/analysis/EdgeNGramFilterFactory.java Use these to create a new field type make Solr tokenize and index your terms as, say, uni-grams. Instead (or in addition to) indexing "Termobyxa", index "T e r m o b y x a". Do the same with the query-time analyzer, and you'll be able to search within words. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Daniel Löfquist <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Thursday, April 17, 2008 5:46:15 AM Subject: Searching "inside of words" Hi, I'm still pretty new to Solr. We're using it for searching on our site right now though. The configuration is however pretty much based on the example-files that come with Solr and there's one type of search that I can't get to work. Each item has fields called "title" and "description", both of which are of type "text". The type "text" is defined like this in our schema.xml : words="stopwords.txt"/> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> ignoreCase="true" expand="true"/> words="stopwords.txt"/> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> My problem is that if I have an item with "title"="Termobyxa", a search for "Termo" gives me a hit but if I search for "ermo" or "byxa" I get no hit. How do I make it so that this kind of search "inside a word" returns a hit? Sincerely, Daniel Löfquist -- Daniel Löfquist Application Manager / Software Engineer CDON.COM Bergsgatan 20, Box 385, SE 201 23 Malmö, Sweden Office: +46 40 601 61 00 Direct: +46 40 601 61 16 Mobile: +46 702 92 21 75 Fax: +46 40 601 61 20 E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> CDON.COM <http://www.cdon.com/> Confidentiality Information contained in this e-mail is intended for the use of the addressee only, and is confidential. Any dissemination, distribution, copying or use of this communication without prior permission of the addressee is strictly prohibited. If you are not the intended addressee you must delete this e-mail and its attachments.
Re: Searching "inside of words"
Thank you for your reply. I've been trying some things out this morning but I'm still not getting it to work properly. I have a feeling that I'm on the right track somewhat though. The type in my schema.xml looks like this: If I'm understanding everything correctly this should create tokens with the size of 2 to 18 letters at the time of indexing, right? However, I can't search properly now. I have to slice my search-string up into 2-letter chunks. So if I'm searching for "monitor" I have to send "mo+ni+to+r" to Solr. Like this: http://localhost:8080/solrtest/select/?q=mo+ni+to+r&q.op=AND when I want it to be like this: http://localhost:8080/solrtest/select/?q=monitor&q.op=AND I'm sure I'm doing something completely wrong. I just need some one more wise to the ways of Lucene and Solr to point directly at what it is that's wrong ;-) //Daniel Chris Hostetter wrote: : so the only ones I can utilize are EdgeNGramTokenizerFactory and : NGramTokenizerFactory. : : I've done some playing around with them but the best result I've gotten so far : is a field-type that enables searching for specific letters, for example I can : search for an item that contains the letters a and x, but it returns a hit no : matter where these letters are in the text, they don't have to be next to each : other, and that's not the result I was going for. If the field contains : "monitor" I want a hit on a search for "onit" but not on "rint" for example. NGramTokenizerFactory should work fine for this ... the key is to use it at indexing time with the appropriate min and max gram sizes to meet your needs -- at query time, don't use it at all (use keyword or whitespace tokenizer) so the word "monitor" will be indexed as these tokens (but not neccessarily in this order)... m o n i t o r mo on ni it to or mon oni nit ... onit ... and at search time when the user gives you "onit" that term will exist. : I've never attempted to construct a new field-type of my own before and I'm : finding the available documentation somewhat incomplete and not very helpful FWIW: creating a new FieldType is almost never what you need if you are dealing with text .. creating new FieldTypes is something that typically only needs done in cases where you want specialized encoding or sorting. -Hoss -- Daniel Löfquist Application Manager / Software Engineer CDON.COM Bergsgatan 20, Box 385, SE 201 23 Malmö, Sweden Office: +46 40 601 61 00 Direct: +46 40 601 61 16 Mobile: +46 702 92 21 75 Fax: +46 40 601 61 20 E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> CDON.COM <http://www.cdon.com/> Confidentiality Information contained in this e-mail is intended for the use of the addressee only, and is confidential. Any dissemination, distribution, copying or use of this communication without prior permission of the addressee is strictly prohibited. If you are not the intended addressee you must delete this e-mail and its attachments.
Re: Searching "inside of words"
Thanks a million! That totally did the trick. It is now working at least 95% like I want it to. Gotta tweak it a little more but it seems like the hard part is over. Thanks once again to everybody who helped out. //Daniel Chris Hostetter wrote: : You are doing the right thing. If you are creating n-grams at index : time, you have to match that at query time. If the query is "monitor", : you need to pass that through n-gram tokenizer, too. n-grams of length : 18 look a little weird you don't *have* to use ngrams at query time ... his goal is "parital" word matching, so he wants to create various sized ngrams so that input like "onit" matches "monitor" but does not match "on it" Daniel: the options for NGramTokenizerFactory are minGramSize and maxGramSize ... not minGram and maxGram ... you are getting the defaults (which are 1 and 2 i think) it confused me too untill i tried you schema changes, and then looked at the analysis.jsp link and saw only 1 and 2 gram tokens being created .. then i checked the class. -Hoss