Re: Multi Language Suggester Solr Issue
I noticed that your suggester analyzers include filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / which seems like a bad idea -- this will strip all those arabic, russian and japanese characters entirely, leaving you with probably only whitespace in your tokens. Try just removing that? -Mike On 12/24/14 6:09 PM, alaa.abuzaghleh wrote: I am trying create suggester handler using solr 4.8, everything work fine but when I try to get suggestion using different language Arabic, or Japanese for example I got result in mixed language, but I am trying to search only using Japanese, I got Arabic with that too. the following is my Schema.xml ?xml version=1.0 encoding=UTF-8 ? schema name=people_schema version=1.5 fields field name=_version_ type=long indexed=true stored=true / field name=id type=string indexed=true stored=true required=true / field name=first_name type=txt_general indexed=true stored=true multiValued=false / field name=last_name type=txt_general indexed=true stored=true multiValued=false / field name=about type=text_general_edge_ngram indexed=true stored=true multiValued=false / field name=year_birth type=tint indexed=true stored=true multiValued=false / field name=month_birth type=tint indexed=true stored=true multiValued=false / field name=day_birth type=tint indexed=true stored=true multiValued=false / field name=country type=string indexed=true stored=true required=false multiValued=false / field name=country_tree type=placetree indexed=true stored=false multiValued=false / field name=state type=string indexed=true stored=true required=false multiValued=false / field name=state_tree type=placetree indexed=true stored=false multiValued=false / field name=city type=string indexed=true stored=true required=false multiValued=false / field name=city_tree type=placetree indexed=true stored=false multiValued=false / field name=job type=string indexed=true stored=true required=false multiValued=false / field name=job_tree type=txt_general indexed=true stored=true multiValued=false / field name=company type=string indexed=true stored=true required=false multiValued=false / field name=company_tree type=companytree indexed=true stored=false multiValued=false / field name=full_name type=txt_general indexed=true stored=true multiValued=false / field name=full_name_suggest type=text_suggest indexed=true stored=true multiValued=false / field name=full_name_edge type=text_suggest_edge indexed=true stored=true multiValued=false / field name=full_name_ngram type=text_suggest_ngram indexed=true stored=true multiValued=false / field name=full_name_sort type=alphaNumericSort indexed=true stored=true multiValued=false / field name=job_suggest type=text_suggest indexed=true stored=true multiValued=false / field name=job_edge type=text_suggest_edge indexed=true stored=true multiValued=false / field name=job_ngram type=text_suggest_ngram indexed=true stored=true multiValued=false / field name=job_sort type=alphaNumericSort indexed=true stored=true multiValued=false / copyField source=full_name dest=full_name_suggest / copyField source=full_name dest=full_name_edge / copyField source=full_name dest=full_name_ngram / copyField source=full_name dest=full_name_sort / copyField source=job_tree dest=job_suggest / copyField source=job_tree dest=job_edge / copyField source=job_tree dest=job_ngram / copyField source=job_tree dest=job_sort / /fields uniqueKeyid/uniqueKey types fieldType name=string class=solr.StrField sortMissingLast=true
Re: distrib=false
Erik I have attached the screen shot of the toplogy , as you can see I have three nodes and no two replicas of the same shard reside on the same node, this was made sure so as not affect the availability. The query that I use is a general get all query of type *:* to test . The behavior I notice is that even though when a particular replica of a shard is queried using distrib=false , the request goes to the other replica of the same shard. Thanks. On Sat, Dec 27, 2014 at 2:10 PM, Erick Erickson erickerick...@gmail.com wrote: How are you sending the request? AFAIK, setting distrib=false should should keep the query from being sent to any other node, although I'm not quite sure what happens when you host multiple replicas of the _same_ shard on the same node. So we need: 1 your topology, How many nodes and what replicas on each? 2 the actual query you send. Best, Erick On Sat, Dec 27, 2014 at 8:14 AM, S.L simpleliving...@gmail.com wrote: Hi All, I have a question regarding distrib=false on the Solr query , it seems that the distribution is restricted across only the shards when the parameter is set to false, meaning if I query a particular node with in a shard with replication factor of more than one , the request could go to another node with in the same shard which is a replica of the node that I made the initial request to, is my understanding correct ? If the answer to my question is yes, then how do we make sure that the request goes to only the node I intend to make the request to ? Thanks.
How to implement multi-set in a Solr schema.
Hi All, I have a use case where I need to group documents that have a same field called bookName , meaning if there are a multiple documents with the same bookName value and if the user input is searched by a query on bookName , I need to be able to group all the documents by the same bookName together, so that I could display them as a group in the UI. What kind of support does Solr provide for such a scenario , and how should I look at changing my schema.xml which as bookName as single valued text field ? Thanks.
Re: Solr performance issues
On 12/26/2014 7:17 AM, Mahmoud Almokadem wrote: We've installed a cluster of one collection of 350M documents on 3 r3.2xlarge (60GB RAM) Amazon servers. The size of index on each shard is about 1.1TB and maximum storage on Amazon is 1 TB so we add 2 SSD EBS General purpose (1x1TB + 1x500GB) on each instance. Then we create logical volume using LVM of 1.5TB to fit our index. The response time is about 1 and 3 seconds for simple queries (1 token). Is the LVM become a bottleneck for our index? SSD is very fast, but its speed is very slow when compared to RAM. The problem here is that Solr must read data off the disk in order to do a query, and even at SSD speeds, that is slow. LVM is not the problem here, though it's possible that it may be a contributing factor. You need more RAM. For Solr to be fast, a large percentage (ideally 100%, but smaller fractions can often be enough) of the index must be loaded into unused RAM by the operating system. Your information seems to indicate that the index is about 3 terabytes. If that's the index size, I would guess that you would need somewhere between 1 and 2 terabytes of total RAM for speed to be acceptable. Because RAM is *very* expensive on Amazon and is not available in sizes like 256GB-1TB, that typically means a lot of their virtual machines, with a lot of shards in SolrCloud. You may find that real hardware is less expensive for very large Solr indexes in the long term than cloud hardware. Thanks, Shawn
How does text-rev work?
I am looking at the collection1/techproducts schema and I can't figure out how the reversed wildcard example is supposed to work. We define text_general_rev type and text_rev field, but we don't seem to be populating it at any point. And running the example does not seem to show any tokens in the field even when the non-inverted text field does have some. Apparently, there is some magic in the QueryParser to do something about this at query time, but I see no explanation of what is supposed to happen at the index/schema time. Anybody has the skinny on this one? Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/
Re: distrib=false
On 12/28/2014 8:48 AM, S.L wrote: I have attached the screen shot of the toplogy , as you can see I have three nodes and no two replicas of the same shard reside on the same node, this was made sure so as not affect the availability. The query that I use is a general get all query of type *:* to test . The behavior I notice is that even though when a particular replica of a shard is queried using distrib=false , the request goes to the other replica of the same shard. Attachments almost never make it through the mailing list processing. The screenshot you mentioned did not make it. You'll need to host the image somewhere and provide a URL. The dropbox service is a good way to do this, but it's not the only way. Just make sure you don't remove the image quickly. The message will live on for years in the archive ... it would be nice to have the image live on for years as well, though I know that is often not realistic. I do not know exactly how SolrCloud handles such requests, but it would not surprise me to learn that it forwards the request to another replica of the same shard on another server. An issue has been put forward to change the general load-balancing behavior of SolrCloud. There has been a fair amount of discussion on it: https://issues.apache.org/jira/browse/SOLR-6832 Thanks, Shawn
Re: How to implement multi-set in a Solr schema.
HI, You can use the grouping in the solr. You can does this by via query or via solrconfig.xml. *A) via query* http://localhost:8983?your_query_params*group=truegroup.field=bookName* You can limit the size of group (how many documents you wants to show), suppose you want to show 5 documents per group on this bookName field then you can specify the parameter *group.limit=5.* *B) via solrconfig* str name=grouptrue/str str name=group.field*bookName*/str str name=group.ngroupstrue/str str name=group.truncatetrue/str With Regards Aman Tandon On Sun, Dec 28, 2014 at 10:29 PM, S.L simpleliving...@gmail.com wrote: Hi All, I have a use case where I need to group documents that have a same field called bookName , meaning if there are a multiple documents with the same bookName value and if the user input is searched by a query on bookName , I need to be able to group all the documents by the same bookName together, so that I could display them as a group in the UI. What kind of support does Solr provide for such a scenario , and how should I look at changing my schema.xml which as bookName as single valued text field ? Thanks.
Re: Multi Language Suggester Solr Issue
thanks it is work for me -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-Language-Suggester-Solr-Issue-tp4176075p4176324.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to implement multi-set in a Solr schema.
Thanks Aman, the thing is the bookName field values are not exactly identical , but nearly identical , so at the time of indexing I need to figure out which other book name field this is similar to using NLP techniques and then put it in the appropriate bag, so that at the retrieval time I only retrieve all the elements from that bag if any one of the element matches with the search query. Thanks. On Dec 28, 2014 1:54 PM, Aman Tandon amantandon...@gmail.com wrote: HI, You can use the grouping in the solr. You can does this by via query or via solrconfig.xml. *A) via query* http://localhost:8983?your_query_params*group=truegroup.field=bookName* You can limit the size of group (how many documents you wants to show), suppose you want to show 5 documents per group on this bookName field then you can specify the parameter *group.limit=5.* *B) via solrconfig* str name=grouptrue/str str name=group.field*bookName*/str str name=group.ngroupstrue/str str name=group.truncatetrue/str With Regards Aman Tandon On Sun, Dec 28, 2014 at 10:29 PM, S.L simpleliving...@gmail.com wrote: Hi All, I have a use case where I need to group documents that have a same field called bookName , meaning if there are a multiple documents with the same bookName value and if the user input is searched by a query on bookName , I need to be able to group all the documents by the same bookName together, so that I could display them as a group in the UI. What kind of support does Solr provide for such a scenario , and how should I look at changing my schema.xml which as bookName as single valued text field ? Thanks.
RE: Solr performance issues
Mahmoud Almokadem [prog.mahm...@gmail.com] wrote: We've installed a cluster of one collection of 350M documents on 3 r3.2xlarge (60GB RAM) Amazon servers. The size of index on each shard is about 1.1TB and maximum storage on Amazon is 1 TB so we add 2 SSD EBS General purpose (1x1TB + 1x500GB) on each instance. Then we create logical volume using LVM of 1.5TB to fit our index. Your search speed will be limited by the slowest storage in your group, which would be your 500GB EBS. The General Purpose SSD option means (as far as I can read at http://aws.amazon.com/ebs/details/#piops) that your baseline of 3 IOPS/MB = 1500 IOPS, with bursts of 3000 IOPS. Unfortunately they do not say anything about latency. For comparison, I checked the system logs from a local test with our 21TB / 7 billion documents index. It used ~27,000 IOPS during the test, with mean search time a bit below 1 second. That was with ~100GB RAM for disk cache, which is about ½% of index size. The test was with simple term queries (1-3 terms) and some faceting. Back of the envelope: 27,000 IOPS for 21TB is ~1300 IOPS/TB. Your indexes are 1.1TB, so 1.1*1300 IOPS ~= 1400 IOPS. All else being equal (which is never the case), getting 1-3 second response times for a 1.1TB index, when one link in the storage chain is capped at a few thousand IOPS, you are using networked storage and you have little RAM for caching, does not seem unrealistic. If possible, you could try temporarily boosting performance of the EBS, to see if raw IO is the bottleneck. The response time is about 1 and 3 seconds for simple queries (1 token). Is the index updated while you are searching? Do you do any faceting or other heavy processing as part of a search? How many hits does a search typically have and how many documents are returned? How many concurrent searches do you need to support? How fast should the response time be? - Toke Eskildsen
Re: Solr server becomes non-responsive.
Thanks Jack for your suggestions. Regards, Modassar On Fri, Dec 26, 2014 at 6:04 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Either you have too little RAM on each node or too much data on each node. You may need to shard the data much more heavily so that the total work on a single query is distributed in parallel to more nodes, each node having a much smaller amount of data to work on. First, always make sure that the entire Lucene index for each node fits entirely in the system memory available for file system caching. Otherwise the queries will be I/O bound. Check your current queries to see if that is the case - are the nodes compute bound or I/O bound? If I/O bound, add more system memory until the queries are no longer I/O bound. If compute bound, shard more heavily until the query latency becomes acceptable. -- Jack Krupansky On Fri, Dec 26, 2014 at 1:02 AM, Modassar Ather modather1...@gmail.com wrote: Thanks for your suggestions Erick. This may be one of those situations where you really have to push back at the users and understand why they insist on these kinds of queries. They must be very patient since it won't be very performant. That said, I've seen this pattern; there are certainly valid conditions under which response times can be many seconds if there are few users and they are doing very complex/expert-level things. We have tried educating the users but it did not work because they are used to the old way. They feel that wildcard gives more control over the results and may not fully understand stemming. Regards, Modassar On Thu, Dec 25, 2014 at 3:17 AM, Erick Erickson erickerick...@gmail.com wrote: There's no magic bullet here that I know of. If your requirements are to support these huge, many-wildcard queries then you only have a few choices: 1 redo the index. I was surprised at how little it bloated the index as far as memory required is concerned to add ngrams. The key here is that there really aren't very many unique terms. If you use bigrams, then there are only maybe 36^2 distinct combinations. (assuming English and including numbers). 2 Increase the number of shards, putting many fewer docs on each shard. 3 give each shard a lot more memory. This isn't actually one of my preferred solutions since GC issues may raise their ugly heads here. 4 insert creative solution here. This may be one of those situations where you really have to push back at the users and understand why they insist on these kinds of queries. They must be very patient since it won't be very performant. That said, I've seen this pattern; there are certainly valid conditions under which response times can be many seconds if there are few users and they are doing very complex/expert-level things. Now, all that said, wildcards are often examples of poor habits or habits learned in DB systems where the only hammer was %whatever%. I've seen situations where users didn't understand that Solr broke the input stream up into words. And stemmed. And WordDelimiterFilterFactory did all the magic for finding, say D.C. and DC. So it's worth looking at the actual queries that are sent, perhaps talking to users and understanding what they _want_ out of the system, then perhaps educating them as to better ways to get what they want. Literally I've seen people insist on entering queries that wildcarded _everything_ both pre and post wildcards because they didn't realize that Solr tokenizes... Once you hit an OOM, all bets are off as Shawn outlined. Best, Erick On Wed, Dec 24, 2014 at 1:57 AM, Modassar Ather modather1...@gmail.com wrote: Thanks for your response. How many items in the collection ? There are about 100 millions documents. How are configured cache in solrconfig.xml ? Each cache has size attribute as 128. Can you provide a sample of the query ? Does it fail immediately after solrcloud startup or after several hours ? It is a query with many terms(more than a thousand) and phrase where phrases have many wildcards in it. Once such query is executed there are many zookeeper related exceptions and with a couple of such queries executed it goes for OutOfMemory. Thanks, Modassar On Wed, Dec 24, 2014 at 1:37 PM, Dominique Bejean dominique.bej...@eolya.fr wrote: And you didn’t give how many RAM on each servers ? 2014-12-24 8:17 GMT+01:00 Dominique Bejean dominique.bej...@eolya.fr : Modassar, How many items in the collection ? I mean how many documents per collection ? 1 million, 10 millions, …? How are configured cache in solrconfig.xml ? What are the size attribute value for each cache ? Can you provide a sample of the query ? Does it fail
Re: Loading data to FieldValueCache
Erick, I am trying to do a premature optimization. *There will be no updates to my index. So, no worries about ageing out or garbage collection.* Let me get my understanding correctly; when we talk about filterCache, it just stores the document IDs in the cache right? And my setup is as follows. There are 16 nodes in my SolrCloud. Each having 64 GB of RAM, out of which I am allocating 45 GB to Solr. I have a collection (say Products, which contains around 100 million Docs), which I created with 64 shards, replication factor 2, and 8 shards per node. Each shard is getting around 1.6 Million Documents. So my math here for filterCache for a specific filter will be - - an average filter query will be 20 bytes, so 1000 (distinct number of states) x 20 = 2 MB - and considering union of DocIds for all the values of a given filter equals to total number of DocId's present in the index. There are 1.6 Million Documents in a solr core. So, 1,600,000 x 8 Bytes (for each Doc Id) equals to 12.8 MB - There will be 8 solrcores per node - 8 x 12.8 MB = *102 MB. * This is the size of cache for a single filter in a single node. Considering the heapsize I have given, I think this shouldn't be an issue.. Thanks, Manohar On Fri, Dec 26, 2014 at 10:56 PM, Erick Erickson erickerick...@gmail.com wrote: Manohar: Please approach this cautiously. You state that you have hundreds of states. Every 100 states will use roughly 1.2G of your filter cache. Just for this field. Plus it'll fill up the cache and they may soon be aged out anyway. Can you really afford the space? Is it really a problem that needs to be solved at this point? This _really_ sounds like premature optimization to me as you haven't demonstrated that there's an actual problem you're solving. OTOH, of course, if you're experimenting to better understand all the ins and outs of the process that's another thing entirely ;) Toke: I don't know the complete algorithm, but if the number of docs that satisfy the fq is small enough, then just the internal Lucene doc IDs are stored rather than a bitset. What exactly small enough is I don't know off the top of my head. And I've got to assume looking stuff up in a list is slower than indexing into a bitset so I suspect small enough is very small On Fri, Dec 26, 2014 at 3:00 AM, Manohar Sripada manohar...@gmail.com wrote: Thanks Toke for the explanation, I will experiment with f.state.facet.method=enum Thanks, Manohar On Fri, Dec 26, 2014 at 4:09 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Manohar Sripada [manohar...@gmail.com] wrote: I have 100 million documents in my index. The maxDoc here is the maximum Documents in each shard, right? How is it determined that each entry will occupy maxDoc/8 approximately. Assuming that it is random whether a document is part of the result set or not, the most efficient representation is 1 bit/doc (this is often called a bitmap or bitset). So the total number of bits will be maxDoc, which is the same as maxDoc/8 bytes. Of course, result sets are rarely random, so it is possible to have other and more compact representations. I do not know how that plays out in Lucene. Hopefully somebody else can help here. If I have to add facet.method=enum every time in the query, how should I specify for each field separately? f.state.facet.method=enum See https://wiki.apache.org/solr/SimpleFacetParameters#Parameters - Toke Eskildsen
Re: How to implement multi-set in a Solr schema.
You can also use group.query or group.func to group documents matching a query or unique values of a function query. For the latter you could implement an NLP algorithm. -- Jack Krupansky On Sun, Dec 28, 2014 at 5:56 PM, Meraj A. Khan mera...@gmail.com wrote: Thanks Aman, the thing is the bookName field values are not exactly identical , but nearly identical , so at the time of indexing I need to figure out which other book name field this is similar to using NLP techniques and then put it in the appropriate bag, so that at the retrieval time I only retrieve all the elements from that bag if any one of the element matches with the search query. Thanks. On Dec 28, 2014 1:54 PM, Aman Tandon amantandon...@gmail.com wrote: HI, You can use the grouping in the solr. You can does this by via query or via solrconfig.xml. *A) via query* http://localhost:8983?your_query_params*group=truegroup.field=bookName* You can limit the size of group (how many documents you wants to show), suppose you want to show 5 documents per group on this bookName field then you can specify the parameter *group.limit=5.* *B) via solrconfig* str name=grouptrue/str str name=group.field*bookName*/str str name=group.ngroupstrue/str str name=group.truncatetrue/str With Regards Aman Tandon On Sun, Dec 28, 2014 at 10:29 PM, S.L simpleliving...@gmail.com wrote: Hi All, I have a use case where I need to group documents that have a same field called bookName , meaning if there are a multiple documents with the same bookName value and if the user input is searched by a query on bookName , I need to be able to group all the documents by the same bookName together, so that I could display them as a group in the UI. What kind of support does Solr provide for such a scenario , and how should I look at changing my schema.xml which as bookName as single valued text field ? Thanks.
Re: solr export get wrong results
Hi, Joel Thanks for your reply. It seems that the weird export results is because that I removed the str namexsort/str invariant of the export request handler in the default sorlconfig.xml to get csv-format output. I don't quite understand the meaning of xsort, but I removed it because I always get json response (as you said) with the xsort invariant. Is there a way to get a csv output using export? And also, can I get full results from all shards? (I tried to set distrib=true but get SyntaxError:xport RankQuery is required for xsort: rq={!xport}, and I do have rq={!xport} in the export invariants) 2014-12-27 3:21 GMT+08:00 Joel Bernstein joels...@gmail.com: Hi Sandy, I pulled Solr 4.10.3 to see if I could recreate the issue you are seeing with export and I wasn't able to recreate the bug you are seeing. For example the following query: http://localhost:8983/solr/collection1/export?q=join_i:[50 TO 500010]wt=jsonindent=truesort=join_i+ascfl=join_i,ShopId_i Brings back the following result: {responseHeader: {status: 0}, response:{numFound:11, docs:[{join_i:50,ShopId_i:578917},{join_i:51,ShopId_i:294217},{join_i:52,ShopId_i:199805},{join_i:53,ShopId_i:633461},{join_i:54,ShopId_i:472995},{join_i:55,ShopId_i:672122},{join_i:56,ShopId_i:394637},{join_i:57,ShopId_i:446443},{join_i:58,ShopId_i:697329},{join_i:59,ShopId_i:166988},{join_i:500010,ShopId_i:191261}]}} Notice the join_i values are all within the correct range. If you can post the export handler configuration we should be able to see the issue. Joel Bernstein Search Engineer at Heliosearch On Fri, Dec 26, 2014 at 1:50 PM, Joel Bernstein joels...@gmail.com wrote: Hi Sandy, The export handler should only return documents in JSON format. The results in your second example are in XML for format so something looks to be wrong in the configuration. Can you post what your solrconfig looks like? Joel Joel Bernstein Search Engineer at Heliosearch On Fri, Dec 26, 2014 at 12:43 PM, Erick Erickson erickerick...@gmail.com wrote: I think you missed a very important part of Jack's reply: bq: I notice that you don't have distrib=false on your select, which would make your select be from all nodes, while export would only be docs from the specific node you sent the request to. And from the Reference Guide on export bq: The initial release treats all queries as non-distributed requests. So the client is responsible for making the calls to each Solr instance and merging the results. So the export statement you're sending is _only_ exporting the results from the shard on 8983 and completely ignoring the other (6?) shards, whereas the query you're sending is getting the results from all the shards. As Jack said, add distrib=false to the query, send it to the same shard you send the export command to and the results should match. Also, be sure your configuration for the /select handler doesn't have any additional default parameters that might alter the results, but I doubt that's really a problem here. Best, Erick On Fri, Dec 26, 2014 at 7:02 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Do you have any custom solr components deployed? May be custom response writer? Ahmet On Friday, December 26, 2014 3:26 PM, Sandy Ding sandy.ding...@gmail.com wrote: Hi, Ahmet, I use libuuid for unique id and I guess there shouldn't be duplicate ids. Also, the results are not just incomplete, they are screwed. 2014-12-26 20:19 GMT+08:00 Ahmet Arslan iori...@yahoo.com.invalid: Hi, Two different things : If you have unique key defined document with same id override within a single shard. Plus, uniqueIDs expected to be unique across shards. Ahmet On Friday, December 26, 2014 11:00 AM, Sandy Ding sandy.ding...@gmail.com wrote: Hi, all I've recently set up a solr cluster and found that export returns different results from select. And I confirmed that the export results are wrong by manually query the results. Even simple queries as follows will get different results: curl http://localhost:8983/solr/pa_info/select?q=*:*fl=idsort=id+desc: responselst name=responseHeaderint name=status0/intint name=QTime11/intlst name=paramsstr name=sortid desc/strstr name=flid/strstr name=q*:*/str/lst/lstresult name=response *numFound=1197* start=0doc.../doc/result curl http://localhost:8983/solr/pa_info/export?q=*:*fl=idsort=id+desc; : {*numFound:172*, docs:[..] Don't have a clue why this happen! Anyone help? Best, Sandy