eDisMax, multiple language support and stopwords
Hi all, Thanks for the help and advice I've got here so far! Another question - I want to support stopwords at search time, so that e.g. the query oscar and wilde is equivalent to oscar wilde (this is with lowercaseOperators=false). Fair enough, I have stopword and in the query analyser chain. However, I also need to support French as well as English, so I've got _en and _fr versions of the text fields, with appropriate stemming and stopwords. I index French content into the _fr fields and English into the _en fields. I'm searching with eDisMax over both versions, e.g.: str name=qfheadline_en headline_fr/str However, this means I get no results for oscar and wilde. The parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:and)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord If I add and to the French stopwords list, I *do* get results, and the parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord This implies that the only solution is to have a minimal, shared stopwords list for all languages I want to support. Is this correct, or is there a way of supporting this kind of searching with per-language stopword lists? Thanks for any ideas! Tom
Re: eDisMax, multiple language support and stopwords
Ah, thanks Markus. I think I'll just add the Boolean operators to the stopwords list in that case. Tom On 7 November 2013 12:01, Markus Jelsma markus.jel...@openindex.io wrote: This is an ancient problem. The issue here is your mm-parameter, it gets confused because for separate fields different amount of tokens are filtered/emitted so it is never going to work just like this. The easiest option is not to use the stopfilter. http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html https://issues.apache.org/jira/browse/SOLR-3085 -Original message- From:Tom Mortimer tom.m.f...@gmail.com Sent: Thursday 7th November 2013 12:50 To: solr-user@lucene.apache.org Subject: eDisMax, multiple language support and stopwords Hi all, Thanks for the help and advice I've got here so far! Another question - I want to support stopwords at search time, so that e.g. the query oscar and wilde is equivalent to oscar wilde (this is with lowercaseOperators=false). Fair enough, I have stopword and in the query analyser chain. However, I also need to support French as well as English, so I've got _en and _fr versions of the text fields, with appropriate stemming and stopwords. I index French content into the _fr fields and English into the _en fields. I'm searching with eDisMax over both versions, e.g.: str name=qfheadline_en headline_fr/str However, this means I get no results for oscar and wilde. The parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:and)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord If I add and to the French stopwords list, I *do* get results, and the parsed query is: (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar)) DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord This implies that the only solution is to have a minimal, shared stopwords list for all languages I want to support. Is this correct, or is there a way of supporting this kind of searching with per-language stopword lists? Thanks for any ideas! Tom
Re: newbie getting started with solr
Hi Eric, Solr configuration can certainly be confusing at first. And for some time after. :P If you're running start.jar from the example folder (which is fine for testing, and I've known some people to use it for production systems) then the default solr home is example/solr. This contains solr.xml, which specifies where to find per-core configuration and data. (A core is equivalent to a collection in a simple non-sharded setup). For now, the easiest thing would be to use the default core in example/solr/collection1. Copy your solrconfig.xml and schema.xml over the ones in collection1/conf (backing up the originals for reference). Create your data directory wherever you like and symlink it into collection1. Now when you run $ java -jar start.jar in example/, you should be able to access Solr at http://localhost:8983/solr/ , and add and search for documents. Hope that helps a bit! Tom On 7 November 2013 14:50, Palmer, Eric epal...@richmond.edu wrote: Sorry if this is obvious (because it isn't for me) I want to build a solr (4.5.1) + nutch (1.7.1) environment. I'm doing this on amazon linux (I may put nutch on a separate server eventually). Please let me know if my thinking is sound or off base in the example folder are a lot of files and folders including the war file and start.jar drwxr-xr-x cloud-scripts drwxr-xr-x contexts drwxr-xr-x etc drwxr-xr-x example-DIH drwxr-xr-x exampledocs drwxr-xr-x example-schemaless drwxr-xr-x lib drwxr-xr-x logs drwxr-xr-x multicore -rw-r--r-- README.txt drwxr-xr-x resources drwxr-xr-x solr drwxr-xr-x solr-webapp -rw-r--r-- start.jar drwxr-xr-x webapps I am creating a separate folder for the conf and data folders (on another disk) and placing these files in the conf file schema-solr.xml (from nutch) renamed to schema.solr solrconfig.xml I will use the example folder and start.jar from that location. (is this okay) Where do I set the collection name? What else do I need to do to get a basic web page indexer built. (I'll work out the crawling later, I just want to be able to manually add some documents and query). I'm trying to understand solr first and then will use nutch. I have several books and have looked at the tutorial and other web sites. It seems they assume that I know where to begin when creating a new collection and customizing it. Thanks in advance for your help. -- Eric Palmer Web Services U of Richmond To report technical issues, obtain technical support or make requests for enhancements please visit http://web.richmond.edu/contact/technical-support.html
eDisMax and Boolean operator case-sensitivity
Hi, I'm using eDisMax query parser, and need to support Boolean operators AND and OR. It seems from testing that these are *not* case sensitive, e.g. setting mm to 0, oscar AND wilde returns the same results as oscar and wilde (15 hits) while oscar foo wilde returns the same results as oscar wilde (2000 hits). Is it possible to configure eDisMax to do case-sensitive parsing, so that AND is an operator but and is just another term? thanks, Tom
Re: eDisMax and Boolean operator case-sensitivity
Oh, good grief - I was just reading that page, how did I miss that? *derp* Thanks Shawn!!! Tom On 6 November 2013 18:59, Shawn Heisey s...@elyograg.org wrote: On 11/6/2013 11:46 AM, Tom Mortimer wrote: I'm using eDisMax query parser, and need to support Boolean operators AND and OR. It seems from testing that these are *not* case sensitive, e.g. setting mm to 0, oscar AND wilde returns the same results as oscar and wilde (15 hits) while oscar foo wilde returns the same results as oscar wilde (2000 hits). Is it possible to configure eDisMax to do case-sensitive parsing, so that AND is an operator but and is just another term? Include another query parameter: lowercaseOperators=false http://wiki.apache.org/solr/ExtendedDisMax#lowercaseOperators Thanks, Shawn
Re: SolrCloud performance in VM environment
Boogie, Shawn, Thanks for the replies. I'm going to try out some of your suggestions today. Although, without more RAM I'm not that optimistic.. Tom On 21 October 2013 18:40, Shawn Heisey s...@elyograg.org wrote: On 10/21/2013 9:48 AM, Tom Mortimer wrote: Hi everyone, I've been working on an installation recently which uses SolrCloud to index 45M documents into 8 shards on 2 VMs running 64-bit Ubuntu (with another 2 identical VMs set up for replicas). The reason we're using so many shards for a relatively small index is that there are complex filtering requirements at search time, to restrict users to items they are licensed to view. Initial tests demonstrated that multiple shards would be required. The total size of the index is about 140GB, and each VM has 16GB RAM (32GB total) and 4 CPU units. I know this is far under what would normally be recommended for an index of this size, and I'm working on persuading the customer to increase the RAM (basically, telling them it won't work otherwise.) Performance is currently pretty poor and I would expect more RAM to improve things. However, there are a couple of other oddities which concern me, Running multiple shards like you are, where each operating system is handling more than one shard, is only going to perform better if your query volume is low and you have lots of CPU cores. If your query volume is high or you only have 2-4 CPU cores on each VM, you might be better off with fewer shards or not sharded at all. The way that I read this is that you've got two physical machines with 32GB RAM, each running two VMs that have 16GB. Each VM houses 4 shards, or 70GB of index. There's a scenario that might be better if all of the following are true: 1) I'm right about how your hardware is provisioned. 2) You or the client owns the hardware. 3) You have an extremely low-end third machine available - single CPU with 1GB of RAM would probably be enough. In this scenario, you run one Solr instance and one zookeeper instance on each of your two big machines, and use the third wimpy machine as a third zookeeper node. No virtualization. For the rest of my reply, I'm assuming that you haven't taken this step, but it will probably apply either way. The first is that I've been reindexing a fixed set of 500 docs to test indexing and commit performance (with soft commits within 60s). The time taken to complete a hard commit after this is longer than I'd expect, and highly variable - from 10s to 70s. This makes me wonder whether the SAN (which provides all the storage for these VMs and the customers several other VMs) is being saturated periodically. I grabbed some iostat output on different occasions to (possibly) show the variability: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 64.50 0.00 2476.00 0 4952 ... sdb 8.90 0.00 348.00 0 6960 ... sdb 1.15 0.0043.20 0864 There are two likely possibilities for this. One or both of them might be in play. 1) Because the OS disk cache is small, not much of the index can be cached. This can result in a lot of disk I/O for a commit, slowing things way down. Increasing the size of the OS disk cache is really the only solution for that. 2) Cache autowarming, particularly the filter cache. In the cache statistics, you can see how long each cache took to warm up after the last searcher was opened. The solution for that is to reduce the autowarmCount values. The other thing that confuses me is that after a Solr restart or hard commit, search times average about 1.2s under light load. After searching the same set of queries for 5-6 iterations this improves to 0.1s. However, in either case - cold or warm - iostat reports no device reads at all: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 0.40 0.00 8.00 0160 ... sdb 0.30 0.0010.40 0104 (the writes are due to logging). This implies to me that the 'hot' blocks are being completely cached in RAM - so why the variation in search time and the number of iterations required to speed it up? Linux is pretty good about making limited OS disk cache resources work. Sounds like the caching is working reasonably well for queries. It might not be working so well for updates or commits, though. Running multiple Solr JVMs per machine, virtual or not, causes more problems than it solves. Solr has no limits on the number of cores (shard replicas) per instance, assuming there are enough system resources. There should be exactly one Solr JVM per operating system. Running more than one results in quite a lot of overhead, and your memory is precious. When you create a collection, you can give the collections API
Re: SolrCloud performance in VM environment
Just tried it with no other changes than upping the RAM to 128GB total, and it's flying. I think that proves that RAM is good. =) Will implement suggested changes later, though. cheers, Tom On 22 October 2013 09:04, Tom Mortimer tom.m.f...@gmail.com wrote: Boogie, Shawn, Thanks for the replies. I'm going to try out some of your suggestions today. Although, without more RAM I'm not that optimistic.. Tom On 21 October 2013 18:40, Shawn Heisey s...@elyograg.org wrote: On 10/21/2013 9:48 AM, Tom Mortimer wrote: Hi everyone, I've been working on an installation recently which uses SolrCloud to index 45M documents into 8 shards on 2 VMs running 64-bit Ubuntu (with another 2 identical VMs set up for replicas). The reason we're using so many shards for a relatively small index is that there are complex filtering requirements at search time, to restrict users to items they are licensed to view. Initial tests demonstrated that multiple shards would be required. The total size of the index is about 140GB, and each VM has 16GB RAM (32GB total) and 4 CPU units. I know this is far under what would normally be recommended for an index of this size, and I'm working on persuading the customer to increase the RAM (basically, telling them it won't work otherwise.) Performance is currently pretty poor and I would expect more RAM to improve things. However, there are a couple of other oddities which concern me, Running multiple shards like you are, where each operating system is handling more than one shard, is only going to perform better if your query volume is low and you have lots of CPU cores. If your query volume is high or you only have 2-4 CPU cores on each VM, you might be better off with fewer shards or not sharded at all. The way that I read this is that you've got two physical machines with 32GB RAM, each running two VMs that have 16GB. Each VM houses 4 shards, or 70GB of index. There's a scenario that might be better if all of the following are true: 1) I'm right about how your hardware is provisioned. 2) You or the client owns the hardware. 3) You have an extremely low-end third machine available - single CPU with 1GB of RAM would probably be enough. In this scenario, you run one Solr instance and one zookeeper instance on each of your two big machines, and use the third wimpy machine as a third zookeeper node. No virtualization. For the rest of my reply, I'm assuming that you haven't taken this step, but it will probably apply either way. The first is that I've been reindexing a fixed set of 500 docs to test indexing and commit performance (with soft commits within 60s). The time taken to complete a hard commit after this is longer than I'd expect, and highly variable - from 10s to 70s. This makes me wonder whether the SAN (which provides all the storage for these VMs and the customers several other VMs) is being saturated periodically. I grabbed some iostat output on different occasions to (possibly) show the variability: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 64.50 0.00 2476.00 0 4952 ... sdb 8.90 0.00 348.00 0 6960 ... sdb 1.15 0.0043.20 0864 There are two likely possibilities for this. One or both of them might be in play. 1) Because the OS disk cache is small, not much of the index can be cached. This can result in a lot of disk I/O for a commit, slowing things way down. Increasing the size of the OS disk cache is really the only solution for that. 2) Cache autowarming, particularly the filter cache. In the cache statistics, you can see how long each cache took to warm up after the last searcher was opened. The solution for that is to reduce the autowarmCount values. The other thing that confuses me is that after a Solr restart or hard commit, search times average about 1.2s under light load. After searching the same set of queries for 5-6 iterations this improves to 0.1s. However, in either case - cold or warm - iostat reports no device reads at all: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 0.40 0.00 8.00 0160 ... sdb 0.30 0.0010.40 0104 (the writes are due to logging). This implies to me that the 'hot' blocks are being completely cached in RAM - so why the variation in search time and the number of iterations required to speed it up? Linux is pretty good about making limited OS disk cache resources work. Sounds like the caching is working reasonably well for queries. It might not be working so well for updates or commits, though. Running multiple Solr JVMs per machine, virtual or not, causes more problems than it solves. Solr has no limits on the number of cores (shard replicas) per
SolrCloud performance in VM environment
Hi everyone, I've been working on an installation recently which uses SolrCloud to index 45M documents into 8 shards on 2 VMs running 64-bit Ubuntu (with another 2 identical VMs set up for replicas). The reason we're using so many shards for a relatively small index is that there are complex filtering requirements at search time, to restrict users to items they are licensed to view. Initial tests demonstrated that multiple shards would be required. The total size of the index is about 140GB, and each VM has 16GB RAM (32GB total) and 4 CPU units. I know this is far under what would normally be recommended for an index of this size, and I'm working on persuading the customer to increase the RAM (basically, telling them it won't work otherwise.) Performance is currently pretty poor and I would expect more RAM to improve things. However, there are a couple of other oddities which concern me, The first is that I've been reindexing a fixed set of 500 docs to test indexing and commit performance (with soft commits within 60s). The time taken to complete a hard commit after this is longer than I'd expect, and highly variable - from 10s to 70s. This makes me wonder whether the SAN (which provides all the storage for these VMs and the customers several other VMs) is being saturated periodically. I grabbed some iostat output on different occasions to (possibly) show the variability: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 64.50 0.00 2476.00 0 4952 ... sdb 8.90 0.00 348.00 0 6960 ... sdb 1.15 0.0043.20 0864 The other thing that confuses me is that after a Solr restart or hard commit, search times average about 1.2s under light load. After searching the same set of queries for 5-6 iterations this improves to 0.1s. However, in either case - cold or warm - iostat reports no device reads at all: Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 0.40 0.00 8.00 0160 ... sdb 0.30 0.0010.40 0104 (the writes are due to logging). This implies to me that the 'hot' blocks are being completely cached in RAM - so why the variation in search time and the number of iterations required to speed it up? The Solr caches are only being used lightly by these tests and there are no evictions. GC is not a significant overhead. Each Solr shard runs in a separate JVM with 1GB heap. I don't have a great deal of experience in low-level performance tuning, so please forgive any naivety. Any ideas of what to do next would be greatly appreciated. I don't currently have details of the VM implementation but can get hold of this if it's relevant. thanks, Tom
Re: Restricting search results by field value
Sounds like it's worth a try! Thanks Andre. Tom On 5 Dec 2012, at 17:49, Andre Bois-Crettez andre.b...@kelkoo.com wrote: If you do grouping on source_id, it should be enough to request 3 times more documents than you need, then reorder and drop the bottom. Is a 3x overhead acceptable ? On 12/05/2012 12:04 PM, Tom Mortimer wrote: Hi everyone, I've got a problem where I have docs with a source_id field, and there can be many docs from each source. Searches will typically return docs from many sources. I want to restrict the number of docs from each source in results, so there will be no more than (say) 3 docs from source_id=123 etc. Field collapsing is the obvious approach, but I want to get the results back in relevancy order, not grouped by source_id. So it looks like I'll have to fetch more docs than I need to and re-sort them. It might even be better to count source_ids in the client code and drop excess docs that way, but the potential overhead is large. Is there any way of doing this in Solr without hacking in a custom Lucene Collector? (which doesn't look all that straightforward). cheers, Tom -- André Bois-Crettez Search technology, Kelkoo http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Restricting search results by field value
Thanks, but even with group.main=true the results are not in relevancy (score) order, they are in group order. Which is why I can't use it as is. Tom On 6 Dec 2012, at 19:00, Way Cool way1.wayc...@gmail.com wrote: Grouping should work: group=truegroup.field=source_idgroup.limit=3group.main=true On Thu, Dec 6, 2012 at 2:35 AM, Tom Mortimer bano...@gmail.com wrote: Sounds like it's worth a try! Thanks Andre. Tom On 5 Dec 2012, at 17:49, Andre Bois-Crettez andre.b...@kelkoo.com wrote: If you do grouping on source_id, it should be enough to request 3 times more documents than you need, then reorder and drop the bottom. Is a 3x overhead acceptable ? On 12/05/2012 12:04 PM, Tom Mortimer wrote: Hi everyone, I've got a problem where I have docs with a source_id field, and there can be many docs from each source. Searches will typically return docs from many sources. I want to restrict the number of docs from each source in results, so there will be no more than (say) 3 docs from source_id=123 etc. Field collapsing is the obvious approach, but I want to get the results back in relevancy order, not grouped by source_id. So it looks like I'll have to fetch more docs than I need to and re-sort them. It might even be better to count source_ids in the client code and drop excess docs that way, but the potential overhead is large. Is there any way of doing this in Solr without hacking in a custom Lucene Collector? (which doesn't look all that straightforward). cheers, Tom -- André Bois-Crettez Search technology, Kelkoo http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Restricting search results by field value
Hi everyone, I've got a problem where I have docs with a source_id field, and there can be many docs from each source. Searches will typically return docs from many sources. I want to restrict the number of docs from each source in results, so there will be no more than (say) 3 docs from source_id=123 etc. Field collapsing is the obvious approach, but I want to get the results back in relevancy order, not grouped by source_id. So it looks like I'll have to fetch more docs than I need to and re-sort them. It might even be better to count source_ids in the client code and drop excess docs that way, but the potential overhead is large. Is there any way of doing this in Solr without hacking in a custom Lucene Collector? (which doesn't look all that straightforward). cheers, Tom
Re: AutoIndexing
Hi Darshan, Can you give us some more details, e.g. what do you mean by database? A RDBMS? Which software? How are you indexing it (or intending to index it) to Solr? etc... cheers, Tom On 25 Sep 2012, at 09:55, darshan dk...@dreamsoftech.com wrote: Hi All, Is there any way where I can auto-index whenever there is changes in my database. Thanks, Darshan
Re: How can I create about 100000 independent indexes in Solr?
Hi, Why do you think that the indexes should be independent? What would be the problem with using a single index and filter queries? Tom On 25 Sep 2012, at 03:21, 韦震宇 weizhe...@win-trust.com wrote: Dear all, The company I'm working in have a website to server more than 10 customers, and every customer should have it's own search cataegory. So I should create independent index for every customer. The site http://wiki.apache.org/solr/MultipleIndexes give some solution to create multiple indexes. I want to use multicore solution. But i'm afraid that Solr can't support so many indexes in this solution. The other solution Flattening data into a single index is a choice, but i think it's best to keep all indexes indepent. Could you tell me how to create about 10 independent indexes in Solr? Thank you all for reply!
Re: AutoIndexing
I'm afraid I don't have any DIH experience myself, but some googling suggests that using a postgresql trigger to start a delta import might be one approach: http://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command and http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport Tom On 25 Sep 2012, at 11:28, darshan dk...@dreamsoftech.com wrote: My Document is Database(yes RDBMS) and software for it is postgresql, where any change in it's table should be reflected, without re-indexing. I am indexing it via DIH process Thanks, Darshan -Original Message- From: Tom Mortimer [mailto:tom.m.f...@gmail.com] Sent: Tuesday, September 25, 2012 3:31 PM To: solr-user@lucene.apache.org Subject: Re: AutoIndexing Hi Darshan, Can you give us some more details, e.g. what do you mean by database? A RDBMS? Which software? How are you indexing it (or intending to index it) to Solr? etc... cheers, Tom On 25 Sep 2012, at 09:55, darshan dk...@dreamsoftech.com wrote: Hi All, Is there any way where I can auto-index whenever there is changes in my database. Thanks, Darshan
Re: ID reference field - Needed but not searchable or retrievable
Hi James, If you don't want this field to be included in user searches, just omit it from the search configuration (e.g. if using eDisMax parser, don't put it in the qf list). To keep it out of search results, exclude it from the fl list. See http://wiki.apache.org/solr/CommonQueryParameters and http://wiki.apache.org/solr/ExtendedDisMax The overhead from storing it will most likely be very small, and as Micheal points out it means you could potentially reference documents both ways. Not sure about the JSON question, but in XML deletequeryuniqueID:7/query/delete would remove the whole doc, not just the uniqueID field. Tom On 20 Sep 2012, at 13:38, Spadez james_will...@hotmail.com wrote: Hi. My SQL database assigns a uniqueID to each item. I want to keep this uniqueID assosiated to the items that are in Solr even though I wont ever need to display them or have them searchable. I do however what to be able to target specific items in Solr with it, for updating or deleting the record. Right now I have this in my schema: field name=id type=string indexed=true stored=true/ However, since I dont want it searchable or stored it should be this: field name=uniqueid type=string indexed=false stored=false/ Firstly, is this the correct way of doing this? I saw mention of a ignore attibute that can be added. Secondly, if I wanted to do updates to the fields using JSON by targetting the uniqueID can I still do something like this: delete: { uniqueid:7 }, /* delete entry uniqueID=7 */ Thank you for any help you can give. Hope I explained it well enough. James -- View this message in context: http://lucene.472066.n3.nabble.com/ID-reference-field-Needed-but-not-searchable-or-retrievable-tp4009162.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr 4.0 - disappointing results sharding on 1 machine
Hi all, After reading http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/ , I thought I'd do my own experiments. I used 2M docs from wikipedia, indexed in Solr 4.0 Beta on a standard EC2 large instance. I compared an unsharded and 2-shard configuration (the latter set up with SolrCloud following the http://wiki.apache.org/solr/SolrCloud example). I wrote a simple python script to randomly throw queries from a hand-compiled list at Solr. The only extra I had turned on was facets (on document category). To my surprise, the performance of the 2-shard configuration is almost exactly half that of the unsharded index - unsharded 4983912891 results in 24920 searches; 0 errors 70.02 mean qps 0.35s mean query time, 2.25s max, 0.00s min 90% of qtimes = 0.83s 99% of qtimes = 1.42s 99.9% of qtimes = 1.68s 2-shard 4990351660 results in 24501 searches; 0 errors 34.07 mean qps 0.66s mean query time, 694.20s max, 0.01s min 90% of qtimes = 1.19s 99% of qtimes = 2.12s 99.9% of qtimes = 2.95s All caches were set to 4096 items, and performance looks ok in both cases (hit ratios close to 1.0, 0 evictions). I gave the single VM -Xmx1G and each shard VM -Xmx500M. I must be doing something stupid - surely this result is unexpected? Does anybody have any thoughts where it might be going wrong? cheers, Tom
Re: Solr 4.0 - disappointing results sharding on 1 machine
Before anyone asks, these results were obtained warm. On 20 Sep 2012, at 14:39, Tom Mortimer tom.m.f...@gmail.com wrote: Hi all, After reading http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/ , I thought I'd do my own experiments. I used 2M docs from wikipedia, indexed in Solr 4.0 Beta on a standard EC2 large instance. I compared an unsharded and 2-shard configuration (the latter set up with SolrCloud following the http://wiki.apache.org/solr/SolrCloud example). I wrote a simple python script to randomly throw queries from a hand-compiled list at Solr. The only extra I had turned on was facets (on document category). To my surprise, the performance of the 2-shard configuration is almost exactly half that of the unsharded index - unsharded 4983912891 results in 24920 searches; 0 errors 70.02 mean qps 0.35s mean query time, 2.25s max, 0.00s min 90% of qtimes = 0.83s 99% of qtimes = 1.42s 99.9% of qtimes = 1.68s 2-shard 4990351660 results in 24501 searches; 0 errors 34.07 mean qps 0.66s mean query time, 694.20s max, 0.01s min 90% of qtimes = 1.19s 99% of qtimes = 2.12s 99.9% of qtimes = 2.95s All caches were set to 4096 items, and performance looks ok in both cases (hit ratios close to 1.0, 0 evictions). I gave the single VM -Xmx1G and each shard VM -Xmx500M. I must be doing something stupid - surely this result is unexpected? Does anybody have any thoughts where it might be going wrong? cheers, Tom
Re: Personalized Boosting
I'm still not sure I understand what it is you're trying to do. Index-time or query-time boosts would probably be neater and more predictable than multiple field instances, though. http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22field.22 http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_change_the_score_of_a_document_based_on_the_.2Avalue.2A_of_a_field_.28say.2C_.22popularity.22.29 Tom On 19 Sep 2012, at 02:49, deniz denizdurmu...@gmail.com wrote: Hello Tom Thank you for your link, but after overviewing it, I dont think it will help... In my case, it will be dynamic, rather than setting a config file and as you think of a big country like Russia or China, i will need to add all cities manually to the elevator.xml file, and also the boosted users, which is not something i desire... it seems like duplicating the values in the location field is the best (at least quickest) solution for this case.. - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/Personalized-Boosting-tp4008495p4008783.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr4 how to make it do this?
Hi George, I don't think this will work. The synonyms will be added after the query is parsed, so you'll have terms like bed:3 rather than matching 3 against the bed field. If I was implementing this I'd try doing some pattern matching before passing the query to Solr, e.g.: 3 bed Surrey - q=Surrey fq=bed:3 I guess this kind of thing could also be implemented as a Solr query plug-in. Don't know if anything like it exists. Tom On 18 Sep 2012, at 11:30, george123 daniel.tarase...@gmail.com wrote: I guess I could come up with a synonyms.txt file and every instance of 3 bed I change to bed:3 it should work. eg 3 bed = bed:3 not exactly a synonym or what it was designed for, but it might work? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr4-how-to-make-it-do-this-tp4008574p4008576.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Personalized Boosting
Hi, Would this do the job? http://wiki.apache.org/solr/QueryElevationComponent Tom On 18 Sep 2012, at 01:36, deniz denizdurmu...@gmail.com wrote: Hello All, I have a requirement or a pre=requirement for our search application. Basically the engine will be on a website with plenty of users and more than 20 different fields, including location. So basically, the question is this: Is it possible to let user's define their position in search when location is queried? Let's say that I am UserA and when you make a search with Moscow, my default ranking is 258. By clicking a button, something like Boost Me!, I would like to see UserA as the first user when search is done by Moscow query. is this possible? I have some ideas (like adding more of the persons location like 10 times to their location field, so it will score the highest and so on) but I am not sure if the requirement is hard or easy to implement or will it require a plugin rather than config changes... anyone has any ideas? - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/Personalized-Boosting-tp4008495.html Sent from the Solr - User mailing list archive at Nabble.com.
Error loading a custom request handler in Solr 4.0
Hi, Apologies if this is really basic. I'm trying to learn how to create a custom request handler, so I wrote the minimal class (attached), compiled and jar'd it, and placed it in example/lib. I added this to solrconfig.xml: requestHandler name=/flaxtest class=FlaxTestHandler / When I started Solr with java -jar start.jar, I got this: ... SEVERE: java.lang.NoClassDefFoundError: org/apache/solr/handler/RequestHandlerBase at java.lang.ClassLoader.defineClass1(Native Method) ... So I copied all the dist/*.jar files into lib and tried again. This time it seemed to start ok, but browsing to http://localhost:8983/solr/ displayed this: org.apache.solr.common.SolrException: Error Instantiating Request Handler, FlaxTestHandler is not a org.apache.solr.request.SolrRequestHandler at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410) ... Any ideas? thanks, Tom
Re: Error loading a custom request handler in Solr 4.0
Sure - import org.apache.solr.request.SolrQueryRequest; import org.apache.solr.response.SolrQueryResponse; import org.apache.solr.handler.RequestHandlerBase; public class FlaxTestHandler extends RequestHandlerBase { public FlaxTestHandler() { } public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception { rsp.add(FlaxTest, Hello!); } public String getDescription() { return Flax; } public String getSourceId() { return Flax; } public String getSource() { return Flax; } public String getVersion() { return Flax; } } On 10 August 2011 16:43, simon mtnes...@gmail.com wrote: Th attachment isn't showing up (in gmail, at least). Can you inline the relevant bits of code ? On Wed, Aug 10, 2011 at 11:05 AM, Tom Mortimer t...@flax.co.uk wrote: Hi, Apologies if this is really basic. I'm trying to learn how to create a custom request handler, so I wrote the minimal class (attached), compiled and jar'd it, and placed it in example/lib. I added this to solrconfig.xml: requestHandler name=/flaxtest class=FlaxTestHandler / When I started Solr with java -jar start.jar, I got this: ... SEVERE: java.lang.NoClassDefFoundError: org/apache/solr/handler/RequestHandlerBase at java.lang.ClassLoader.defineClass1(Native Method) ... So I copied all the dist/*.jar files into lib and tried again. This time it seemed to start ok, but browsing to http://localhost:8983/solr/displayed this: org.apache.solr.common.SolrException: Error Instantiating Request Handler, FlaxTestHandler is not a org.apache.solr.request.SolrRequestHandler at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410) ... Any ideas? thanks, Tom
Re: how to ignore case in solr search field?
You can use solr.LowerCaseFilterFactory in an analyser chain for both indexing and queries. The schema.xml supplied with example has several field types using this (including text_general). Tom On 10 August 2011 16:42, nagarjuna nagarjuna.avul...@gmail.com wrote: Hi please help me .. how to ignore case while searching in solr ex:i need same results for the keywords abc, ABC , aBc,AbC and all the cases. Thank u in advance -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-ignore-case-in-solr-search-field-tp3242967p3242967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Error loading a custom request handler in Solr 4.0
Interesting.. is this in trunk (4.0)? Maybe I've broken mine somehow! What classpath did you use for compiling? And did you copy anything other than the new jar into lib/ ? thanks, Tom On 10 August 2011 18:07, simon mtnes...@gmail.com wrote: It's working for me. Compiled, inserted in solr/lib, added the config line to solrconfig. when I send a /flaxtest request i get response lst name=responseHeader int name=status0/int int name=QTime16/int /lst str name=FlaxTestHello!/str /response I was doing this within a core defined in solr.xml -Simon On Wed, Aug 10, 2011 at 11:46 AM, Tom Mortimer t...@flax.co.uk wrote: Sure - import org.apache.solr.request.SolrQueryRequest; import org.apache.solr.response.SolrQueryResponse; import org.apache.solr.handler.RequestHandlerBase; public class FlaxTestHandler extends RequestHandlerBase { public FlaxTestHandler() { } public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception { rsp.add(FlaxTest, Hello!); } public String getDescription() { return Flax; } public String getSourceId() { return Flax; } public String getSource() { return Flax; } public String getVersion() { return Flax; } } On 10 August 2011 16:43, simon mtnes...@gmail.com wrote: Th attachment isn't showing up (in gmail, at least). Can you inline the relevant bits of code ? On Wed, Aug 10, 2011 at 11:05 AM, Tom Mortimer t...@flax.co.uk wrote: Hi, Apologies if this is really basic. I'm trying to learn how to create a custom request handler, so I wrote the minimal class (attached), compiled and jar'd it, and placed it in example/lib. I added this to solrconfig.xml: requestHandler name=/flaxtest class=FlaxTestHandler / When I started Solr with java -jar start.jar, I got this: ... SEVERE: java.lang.NoClassDefFoundError: org/apache/solr/handler/RequestHandlerBase at java.lang.ClassLoader.defineClass1(Native Method) ... So I copied all the dist/*.jar files into lib and tried again. This time it seemed to start ok, but browsing to http://localhost:8983/solr/displayed this: org.apache.solr.common.SolrException: Error Instantiating Request Handler, FlaxTestHandler is not a org.apache.solr.request.SolrRequestHandler at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410) ... Any ideas? thanks, Tom
Re: Error loading a custom request handler in Solr 4.0
Thanks Simon. I'll try again tomorrow. Tom On 10 August 2011 18:46, simon mtnes...@gmail.com wrote: This is in trunk (up to date). Compiler is 1.6.0_26 classpath was dist/apache-solr-solrj-4.0-SNAPSHOT.jar:dist/apache-solr-core-4.0-SNAPSHOT.jar built from trunk just prior by 'ant dist' I'd try again with a clean trunk . -Simon On Wed, Aug 10, 2011 at 1:20 PM, Tom Mortimer t...@flax.co.uk wrote: Interesting.. is this in trunk (4.0)? Maybe I've broken mine somehow! What classpath did you use for compiling? And did you copy anything other than the new jar into lib/ ? thanks, Tom On 10 August 2011 18:07, simon mtnes...@gmail.com wrote: It's working for me. Compiled, inserted in solr/lib, added the config line to solrconfig. when I send a /flaxtest request i get response lst name=responseHeader int name=status0/int int name=QTime16/int /lst str name=FlaxTestHello!/str /response I was doing this within a core defined in solr.xml -Simon On Wed, Aug 10, 2011 at 11:46 AM, Tom Mortimer t...@flax.co.uk wrote: Sure - import org.apache.solr.request.SolrQueryRequest; import org.apache.solr.response.SolrQueryResponse; import org.apache.solr.handler.RequestHandlerBase; public class FlaxTestHandler extends RequestHandlerBase { public FlaxTestHandler() { } public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception { rsp.add(FlaxTest, Hello!); } public String getDescription() { return Flax; } public String getSourceId() { return Flax; } public String getSource() { return Flax; } public String getVersion() { return Flax; } } On 10 August 2011 16:43, simon mtnes...@gmail.com wrote: Th attachment isn't showing up (in gmail, at least). Can you inline the relevant bits of code ? On Wed, Aug 10, 2011 at 11:05 AM, Tom Mortimer t...@flax.co.uk wrote: Hi, Apologies if this is really basic. I'm trying to learn how to create a custom request handler, so I wrote the minimal class (attached), compiled and jar'd it, and placed it in example/lib. I added this to solrconfig.xml: requestHandler name=/flaxtest class=FlaxTestHandler / When I started Solr with java -jar start.jar, I got this: ... SEVERE: java.lang.NoClassDefFoundError: org/apache/solr/handler/RequestHandlerBase at java.lang.ClassLoader.defineClass1(Native Method) ... So I copied all the dist/*.jar files into lib and tried again. This time it seemed to start ok, but browsing to http://localhost:8983/solr/displayed this: org.apache.solr.common.SolrException: Error Instantiating Request Handler, FlaxTestHandler is not a org.apache.solr.request.SolrRequestHandler at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:410) ... Any ideas? thanks, Tom
Highlighting not working
Hi, I'm having trouble getting highlighting to work for a large text field. This field can be in several languages, so I'm sending it to one of several fields configured appropriately (e.g. cv_text_en) and then copying it to a common field for storage and display (cv_text). The relevant fragment of schema.xml looks like: field name=cv_text_en type=text_en indexed=true stored=false termVectors=true termPositions=true/ ... copyField source=cv_text_* dest=cv_text / field name=cv_text type=text indexed=false stored=true / At search time I can't get cv_text to be highlighted - it's returned in its entirety. Here's the relevant bit of solrconfig.xml (I'm qt=all with the default request handler): requestHandler name=all class=solr.SearchHandler default=false lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfcv_text_en/str str name=dfcv_text_de/str ... str name=hlon/str str name=hl.flcv_text/str /lst I've tried playing with other hl. parameters, but have had no luck so far. Any ideas? thanks, Tom
Re: Highlighting not working
I guess what I'm asking is - can Solr highlight non-indexed fields? Tom On 7 April 2011 11:33, Tom Mortimer t...@flax.co.uk wrote: Hi, I'm having trouble getting highlighting to work for a large text field. This field can be in several languages, so I'm sending it to one of several fields configured appropriately (e.g. cv_text_en) and then copying it to a common field for storage and display (cv_text). The relevant fragment of schema.xml looks like: field name=cv_text_en type=text_en indexed=true stored=false termVectors=true termPositions=true/ ... copyField source=cv_text_* dest=cv_text / field name=cv_text type=text indexed=false stored=true / At search time I can't get cv_text to be highlighted - it's returned in its entirety. Here's the relevant bit of solrconfig.xml (I'm qt=all with the default request handler): requestHandler name=all class=solr.SearchHandler default=false lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfcv_text_en/str str name=dfcv_text_de/str ... str name=hlon/str str name=hl.flcv_text/str /lst I've tried playing with other hl. parameters, but have had no luck so far. Any ideas? thanks, Tom
Re: Highlighting not working
Problem solved. *bangs head on desk* T On 7 April 2011 11:33, Tom Mortimer t...@flax.co.uk wrote: Hi, I'm having trouble getting highlighting to work for a large text field. This field can be in several languages, so I'm sending it to one of several fields configured appropriately (e.g. cv_text_en) and then copying it to a common field for storage and display (cv_text). The relevant fragment of schema.xml looks like: field name=cv_text_en type=text_en indexed=true stored=false termVectors=true termPositions=true/ ... copyField source=cv_text_* dest=cv_text / field name=cv_text type=text indexed=false stored=true / At search time I can't get cv_text to be highlighted - it's returned in its entirety. Here's the relevant bit of solrconfig.xml (I'm qt=all with the default request handler): requestHandler name=all class=solr.SearchHandler default=false lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfcv_text_en/str str name=dfcv_text_de/str ... str name=hlon/str str name=hl.flcv_text/str /lst I've tried playing with other hl. parameters, but have had no luck so far. Any ideas? thanks, Tom
copyField at search time / multi-language support
Hi, Here's my problem: I'm indexing a corpus with text in a variety of languages. I'm planning to detect these at index time and send the text to one of a suitably-configured field (e.g. mytext_de for German, mytext_cjk for Chinese/Japanese/Korean etc.) At search time I want to search all of these fields. However, there will be at least 12 of them, which could lead to a very long query string. (Also I need to use the standard query parser rather than dismax, for full query syntax.) Therefore I was wondering if there was a way to copy fields at search time, so I can have my mytext query in a single field and have it copied to mytext_de, mytext_cjk etc. Something like: copyQueryField source=mytext dest=mytext_de / copyQueryField source=mytext dest=mytext_cjk / ... If this is not currently possible, could someone give me some pointers for hacking Solr to support it? Should I subclass solr.SearchHandler? I know nothing about Solr internals at the moment... thanks, Tom