Re: Problem with numeric math types and the dataimport handler
Sounds similar to https://issues.apache.org/jira/browse/SOLR-6165 which I fixed in 4.10. Can you try a newer release? On Wed, May 20, 2015 at 6:51 AM, Shawn Heisey apa...@elyograg.org wrote: An unusual problem is happening with the DIH on a field that is an unsigned BIGINT in the MySQL database. This is Solr 4.9.1 without SolrCloud, running on OpenJDK 7u79. During actual import, everything is fine. The problem comes when I restart Solr and the transaction logs are replayed. I get the following exception for every document replayed: WARN - 2015-05-19 18:52:44.461; org.apache.solr.update.UpdateLog$LogReplayer; REYPLAY_ERR: IOException reading log org.apache.solr.common.SolrException: ERROR: [doc=getty26025060] Error adding field 'file_size'='java.math.BigInteger:5934053' msg=For input string: java.math.BigInteger:5934053 I believe I need one of two things to solve this problem: 1) A connection parameter for the MySQL JDBC driver that will force the use of java.lang.* objects and exclude the java.math.* classes. 2) Write the actual imported value into the transaction log rather than include the class name in the string representation. Testing shows that the toString() method on BigInteger does *NOT* include the class name, so I am confused about why the class name is being recorded in the transaction log. For the first solution, I've been looking for a MySQL connection parameter to change the Java object types that get used, but so far I haven't found one. For the second, I should probably open an issue in Jira, but I wanted to run it by everyone before taking that step. I have another index (building from a different database) where this isn't happening, because the MySQL column is *NOT* unsigned, which causes the JDBC driver to use java.lang.Long instead of java.math.BigInteger. Thanks, Shawn -- Regards, Shalin Shekhar Mangar.
Re: Deduplication
Write a custom update processor and include it in your update chain. You will then have the ability to do anything you want with the entire input document before it hits the code to actually do the indexing. This sounded like the perfect option ... until I read Jack's comment: My understanding was that the distributed update processor is near the end of the chain, so that running of user update processors occurs before the distribution step, but is that distribution to the leader, or distribution from leader to replicas for a shard? That would pose some potential problems. Would a custom update processor make the solution cloud-safe? Thx, - Bram
Re: Looking up arrays in a sub-entity
Personally, I see this as a limit of the dataimporthandler. It gets you started, but when your needs get at all complicated, it can't help you. I would encourage you to write your own indexing code. A little bit of code that reads over your database, sorts it out in the right way, and pushes it to Solr over HTTP post will allow you to achieve what you are aiming for. Upayavira On Tue, May 19, 2015, at 08:51 PM, rumford wrote: I have an entity which extracts records from a MySQL data source. One of the fields is meant to be a multi-value field, except, this data source does not store the values. Rather, it stores their ids in a single column as a pipe-delimited string. The values themselves are in a separate table, in an entirely different database, on a different server. I have written a transformer to make an array out of this delimited string, but after that I'm at a loss. Can I iterate over an array in a sub-entity? I need to query that second data source for each of the IDs that I find in each record of the first data source. Other people who have asked similar questions have been able to solve their issue with a join, but in my case I cannot. -- View this message in context: http://lucene.472066.n3.nabble.com/Looking-up-arrays-in-a-sub-entity-tp4206380.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr query which return only those docs whose all tokens are from given list
Requesting Solr experts again to suggest some solutions to my above problem as i am not able to solve this. On Tue, May 12, 2015 at 11:04 AM, Naresh Yadav nyadav@gmail.com wrote: Thanks Andrew, You got my problem precisely But solutions you suggested may not work for me. In my API i get only list of tags authorized i.e [T1, T2, T3] and based on that only i need to construct my Solr query. So first solution with NOT (T4 OR T5) will not work. In real case tag ids T1, T2 are UUID's, so range query also will not work as i have no control on order of these ids. Looking for more suggestions ?? Thanks Naresh On Mon, May 11, 2015 at 10:05 PM, Andrew Chillrud achill...@opentext.com wrote: Based on his example, it sounds like Naresh not only wants the tags field to contain at least one of the values [T1, T2, T3] but also wants to exclude documents that contain a tag other than T1, T2, or T3 (Doc3 should not be retrieved). If the set of possible values in the tags field is limited and known, you could use a NOT (or '-') clause to accomplish this. If there were 5 possible tag values: tags:(( T1 OR T2 OR T3) NOT (T4 OR T5)) However this doesn't seem practical if the number of possible values is large or unlimited. Perhaps something could be done with range queries: tags:(( T1 OR T2 OR T3) NOT ([* TO T1} OR {T1 TO T2} OR {T3 to * ])) however this would require whatever is constructing the query to be aware of the lexical ordering of the terms in the index. Maybe there are more elegant solutions, but I am not aware of them. - Andy - -Original Message- From: sujitatgt...@gmail.com [mailto:sujitatgt...@gmail.com] On Behalf Of Sujit Pal Sent: Monday, May 11, 2015 10:40 AM To: solr-user@lucene.apache.org Subject: Re: Solr query which return only those docs whose all tokens are from given list Hi Naresh, Couldn't you could just model this as an OR query since your requirement is at least one (but can be more than one), ie: tags:T1 tags:T2 tags:T3 -sujit On Mon, May 11, 2015 at 4:14 AM, Naresh Yadav nyadav@gmail.com wrote: Hi all, Also asked this here : http://stackoverflow.com/questions/30166116 For example i have SOLR docs in which tags field is indexed : Doc1 - tags:T1 T2 Doc2 - tags:T1 T3 Doc3 - tags:T1 T4 Doc4 - tags:T1 T2 T3 Query1 : get all docs with tags:T1 AND tags:T3 then it works and will give Doc2 and Doc4 Query2 : get all docs whose tags must be one of these [T1, T2, T3] Expected is : Doc1, Doc2, Doc4 How to model Query2 in Solr ?? Please help me on this ?
Re: Problem with numeric math types and the dataimport handler
On 5/20/2015 12:06 AM, Shalin Shekhar Mangar wrote: Sounds similar to https://issues.apache.org/jira/browse/SOLR-6165 which I fixed in 4.10. Can you try a newer release? I can't upgrade yet. I am using a plugin that hasn't been verified against anything newer than 4.9. When a new version becomes available, I will begin testing 5.x. The patch does look like it will fix the issue perfectly ... so I am very likely to patch 4.9.1 and build a custom war. Thanks, Shawn
Re: Deduplication
On 19/05/15 14:47, Alessandro Benedetti wrote: Hi Bram, what do you mean with : I would like it to provide the unique value myself, without having the deduplicator create a hash of field values . This is not reduplication, but simple document filtering based on a constraint. In the case you want de-duplication ( which seemed from your very first part of the mail) here you can find a lot of info : Not sure whether de-duplication is the right word for what I'm after, I essentially want a unique constraint on an arbitrary field. Without overwrite semantics, because I want Solr to tell me if a duplicate is sent to Solr. I was thinking that the de-duplication feature could accomplish this somehow. - Bram
[solr 5.1] Looking for full text + collation search field
Hello, might anyone suggest a field type with which I may do both a full text search (i.e. there is an analyzer including a tokenizer) and apply a collation? An example for what I want to do: There is a field composer for which I passed the value Dvořák, Antonín. I want the following queries to match: composer:(antonín dvořák) composer:dvorak composer:dvorak, antonin the latter case is possible using a solr.ICUCollationField, but that type does not support an Analyzer and consequently no tokenizer, thus, it is not helpful. Unlike former versions of solr there do not seem to be CollationKeyFilters which you may hang into the analyzer of a solr.TextField... so I am a bit at a loss how I get *both* a tokenizer and a collation at the same time. Thanks for help, Björn signature.asc Description: OpenPGP digital signature
Re: Deduplication
What the Solr de-duplciation offers you is to calculate for each document in input an Hash ( based on a set of fields). You can then select two options : - Index everything, documents with same signature will be equals - avoid the overwriting of duplicates. How the similarity has is calculated is something you can play with and customise if needed. Clarified that, do you think can fit in some way, or definitely you are not talking about deduce ? 2015-05-20 8:37 GMT+01:00 Bram Van Dam bram.van...@intix.eu: On 19/05/15 14:47, Alessandro Benedetti wrote: Hi Bram, what do you mean with : I would like it to provide the unique value myself, without having the deduplicator create a hash of field values . This is not reduplication, but simple document filtering based on a constraint. In the case you want de-duplication ( which seemed from your very first part of the mail) here you can find a lot of info : Not sure whether de-duplication is the right word for what I'm after, I essentially want a unique constraint on an arbitrary field. Without overwrite semantics, because I want Solr to tell me if a duplicate is sent to Solr. I was thinking that the de-duplication feature could accomplish this somehow. - Bram -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Term Frequency Calculation - Clarification
Thanks Jack. In my case there is only one document - Foo Foo is in bar As per your comment, I should expect TF to be 2. But I am getting one. Is there any check where if one match is a subset of other, is calculated once? My class extends DefaultSimilarity. Cheers Ariya Bala S On Wed, May 20, 2015 at 2:09 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Yes. tf is both 1 and 2 - tf is per document, which is 1 for the first document and 2 for the second document. See: http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html -- Jack Krupansky On Wed, May 20, 2015 at 6:13 AM, ariya bala ariya...@gmail.com wrote: Hi, I have made custom class for scoring the similarity (TermFrequencyBiasedSimilarity). The score was deduced by considering just the TF part (acheived by setting IDF=1). Question is: - *Document content:* Foo Foo is in bar *Search query:* Foo bar *slop:* 3 With Slop 3, There are two matches to the query Foo is in bar Foo Foo is in bar *Should the Term Frequency be 1 or 2? Also point to the explanation of the logic implemented in Lucene/Solr.* -- Cheers *Ariya * -- *Ariya *
Term Frequency Calculation - Clarification
Hi, I have made custom class for scoring the similarity (TermFrequencyBiasedSimilarity). The score was deduced by considering just the TF part (acheived by setting IDF=1). Question is: - *Document content:* Foo Foo is in bar *Search query:* Foo bar *slop:* 3 With Slop 3, There are two matches to the query Foo is in bar Foo Foo is in bar *Should the Term Frequency be 1 or 2? Also point to the explanation of the logic implemented in Lucene/Solr.* -- Cheers *Ariya *
Re: Term Frequency Calculation - Clarification
Yes. tf is both 1 and 2 - tf is per document, which is 1 for the first document and 2 for the second document. See: http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html -- Jack Krupansky On Wed, May 20, 2015 at 6:13 AM, ariya bala ariya...@gmail.com wrote: Hi, I have made custom class for scoring the similarity (TermFrequencyBiasedSimilarity). The score was deduced by considering just the TF part (acheived by setting IDF=1). Question is: - *Document content:* Foo Foo is in bar *Search query:* Foo bar *slop:* 3 With Slop 3, There are two matches to the query Foo is in bar Foo Foo is in bar *Should the Term Frequency be 1 or 2? Also point to the explanation of the logic implemented in Lucene/Solr.* -- Cheers *Ariya *
Problem using a function with a multivalued field
Hi everyone, I’ve been reading answers around this problem but I wanted to make sure that there is another way out of my problem. The thing is that the solution shouldn’t be on index-time, involve indexing a new field or changing this multi-valued field to a single-valued one. Problem: I need to run a custom function with some fields but I see that it’s not possible to get the value (first value in this case) of a multivalued field. “title” is a multi-valued field. See: if(exists(title),strdist(title, “string1),0). This throws the “can’t use FieldCache on a multivalued field” error. Solutions that doesn’t work for me: - Keep a copy of the value into a non-multi-valued field, using an update processor: This involves indexing a new field. - Change the field to multiValued=false: This involves using a single-valued field. I will be indexing new data in the future and I need some fields to be multi-valued but I also need to work with them. Thanks in advance, I spent a lot of time with this without a solution. I’m using Solr 4.10.
When is too many fields in qf is too many?
Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve
Error on grouping result set
Hi, I am having some problem whille grouping the result set.I have a solr schema like this fields field name=id type=string indexed=false stored=true required=true / field name=product type=string indexed=true stored=true required=true / field name=vendor type=string indexed=true stored=true required=true / field name=language type=string indexed=true stored=true required=true / field name=TotalInvoices type=float indexed=true stored=true required=true/ /fieldsI am querying the schema and the result is like this product,Vendor,Invoice abc,vendor1,49206.758 abc,vendor2,35654.981 abc,vendor2,94861.258 abc,vendor3,990.96012 abc,vendor3,990.96012 abc,vendor3,990.9601 I want to group the result by the vendor field so I post a query like this http://localhost:8983/solr/gettingstarted_shard2_replica2/select?q=abc fl=product%2Cvendor%2CTotalInvoices wt=json indent=true debugQuery=true group=true group.field=vendor I am getting an error for this in the debug field. error:{ msg:org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://10.192.17.110:7574/solr/gettingstarted_shard2_replica1, http://10.192.17.110:8983/solr/gettingstarted_shard1_replica2, http://10.192.17.110:7574/solr/gettingstarted_shard1_replica1, http://10.192.17.110:8983/solr/gettingstarted_shard2_replica2];, trace:org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://10.192.17.110:7574/solr/gettingstarted_shard2_replica1, http://10.192.17.110:8983/solr/gettingstarted_shard1_replica2, http://10.192.17.110:7574/solr/gettingstarted_shard1_replica1, http://10.192.17.110:8983/solr/gettingstarted_shard2_replica2]\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:342)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1984)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:247)\n\tat org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:210)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:368)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)\n\tat org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat java.lang.Thread.run(Thread.java:745)\nCaused by: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this
Re: Term Frequency Calculation - Clarification
Please ignore. On Wed, May 20, 2015 at 2:45 PM, ariya bala ariya...@gmail.com wrote: Thanks Jack. In my case there is only one document - Foo Foo is in bar As per your comment, I should expect TF to be 2. But I am getting one. Is there any check where if one match is a subset of other, is calculated once? My class extends DefaultSimilarity. Cheers Ariya Bala S On Wed, May 20, 2015 at 2:09 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Yes. tf is both 1 and 2 - tf is per document, which is 1 for the first document and 2 for the second document. See: http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html -- Jack Krupansky On Wed, May 20, 2015 at 6:13 AM, ariya bala ariya...@gmail.com wrote: Hi, I have made custom class for scoring the similarity (TermFrequencyBiasedSimilarity). The score was deduced by considering just the TF part (acheived by setting IDF=1). Question is: - *Document content:* Foo Foo is in bar *Search query:* Foo bar *slop:* 3 With Slop 3, There are two matches to the query Foo is in bar Foo Foo is in bar *Should the Term Frequency be 1 or 2? Also point to the explanation of the logic implemented in Lucene/Solr.* -- Cheers *Ariya * -- *Ariya * -- *Ariya *
Re: Solr query which return only those docs whose all tokens are from given list
Use update processor to add number of tags per doc. eg check CountFieldValuesUpdateProcessorFactory Doc1 - tags:T1 T2 ; tagNum: 2 Doc2 - tags:T1 T3 ; tagNum: 2 Doc3 - tags:T1 T4 ; tagNum: 2 Doc4 - tags:T1 T2 T3 ; tagNum: 3 than when you search for tags you need to get number of tags matched per document, it can be done with recently implemented via ^= eg tags:(T1^=1 T2^=1 T3^=1) then we need to subtract the expected number of tags per doc q=sub(query($tagsAct)),tagNum)tagsAct=tags:(T1^=1 T2^=1 T3^=1) and then cut off the not enough coverage q={frange l=0}sub(query($tagsAct)),tagNum)tagsAct=tags:(T1^=1 T2^=1 T3^=1) On Wed, May 20, 2015 at 10:10 AM, Naresh Yadav nyadav@gmail.com wrote: Requesting Solr experts again to suggest some solutions to my above problem as i am not able to solve this. On Tue, May 12, 2015 at 11:04 AM, Naresh Yadav nyadav@gmail.com wrote: Thanks Andrew, You got my problem precisely But solutions you suggested may not work for me. In my API i get only list of tags authorized i.e [T1, T2, T3] and based on that only i need to construct my Solr query. So first solution with NOT (T4 OR T5) will not work. In real case tag ids T1, T2 are UUID's, so range query also will not work as i have no control on order of these ids. Looking for more suggestions ?? Thanks Naresh On Mon, May 11, 2015 at 10:05 PM, Andrew Chillrud achill...@opentext.com wrote: Based on his example, it sounds like Naresh not only wants the tags field to contain at least one of the values [T1, T2, T3] but also wants to exclude documents that contain a tag other than T1, T2, or T3 (Doc3 should not be retrieved). If the set of possible values in the tags field is limited and known, you could use a NOT (or '-') clause to accomplish this. If there were 5 possible tag values: tags:(( T1 OR T2 OR T3) NOT (T4 OR T5)) However this doesn't seem practical if the number of possible values is large or unlimited. Perhaps something could be done with range queries: tags:(( T1 OR T2 OR T3) NOT ([* TO T1} OR {T1 TO T2} OR {T3 to * ])) however this would require whatever is constructing the query to be aware of the lexical ordering of the terms in the index. Maybe there are more elegant solutions, but I am not aware of them. - Andy - -Original Message- From: sujitatgt...@gmail.com [mailto:sujitatgt...@gmail.com] On Behalf Of Sujit Pal Sent: Monday, May 11, 2015 10:40 AM To: solr-user@lucene.apache.org Subject: Re: Solr query which return only those docs whose all tokens are from given list Hi Naresh, Couldn't you could just model this as an OR query since your requirement is at least one (but can be more than one), ie: tags:T1 tags:T2 tags:T3 -sujit On Mon, May 11, 2015 at 4:14 AM, Naresh Yadav nyadav@gmail.com wrote: Hi all, Also asked this here : http://stackoverflow.com/questions/30166116 For example i have SOLR docs in which tags field is indexed : Doc1 - tags:T1 T2 Doc2 - tags:T1 T3 Doc3 - tags:T1 T4 Doc4 - tags:T1 T2 T3 Query1 : get all docs with tags:T1 AND tags:T3 then it works and will give Doc2 and Doc4 Query2 : get all docs whose tags must be one of these [T1, T2, T3] Expected is : Doc1, Doc2, Doc4 How to model Query2 in Solr ?? Please help me on this ? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Deduplication
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam bram.van...@intix.eu wrote: Write a custom update processor and include it in your update chain. You will then have the ability to do anything you want with the entire input document before it hits the code to actually do the indexing. This sounded like the perfect option ... until I read Jack's comment: My understanding was that the distributed update processor is near the end of the chain, so that running of user update processors occurs before the distribution step, but is that distribution to the leader, or distribution from leader to replicas for a shard? That would pose some potential problems. Would a custom update processor make the solution cloud-safe? Starting with Solr 5.1, you have the ability to specify an update processor on the fly to requests and you can even control whether it is to be executed before any distribution happens or before it is actually indexed on the replica. e.g. you can specify processor=xyz,MyCustomUpdateProc in the request to have processor xyz run first and then MyCustomUpdateProc and then the default update processor chain (which will also distribute the doc to the leader or from the leader to a replica). This also means that such processors will not be executed on the replicas at all. You can also specify post-processor=xyz,MyCustomUpdateProc to have xyz and MyCustomUpdateProc to run on each replica (including the leader) right before the doc is indexed (i.e. just before RunUpdateProcessor) Unfortunately, due to an oversight, this feature hasn't been documented well which is something I'll fix. See https://issues.apache.org/jira/browse/SOLR-6892 for more details. Thx, - Bram -- Regards, Shalin Shekhar Mangar.
Re: Block Join Query update documents, how to do it correctly?
On Thu, May 14, 2015 at 12:01 AM, Tom Devel deve...@gmail.com wrote: I tried to repost the whole modified document (the parent and ALL of its children as one file), and it seems to work on a small toy example, but of course I cannot be sure for a larger instance with thousands of documents, and I would like to know if this is the correct way to go or not. Absolutely. Here is the only way to go so far. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: When is too many fields in qf is too many?
Also, is this 1500 fields that are always populated, or are there really a larger number of different record types, each with a relatively small number of fields populated in a particular document? Answer: This is a large number of different record types, each with a relatively small number of fields in a particular document. Some documents will have 5 fields, others may have 50 (that's the average) Could you try to point to a real-world example of where your use case might apply, so we can relate to it? I'm indexing data off a DB, all the fields of each record is indexed. The application is complex such that it has views and users belong to 1 or more views. Users can move between views and views can change over time. A user in view-A can see certain fields, while a user in view-B can see some other fields. So, when a user issues a search, I have to limit into which fields that search is executed against. And like I said, because users can move between views, and views can change over time, the list of fields isn't static. This is why I have to pass the list of fields for each search based on user's current view. I hope this gives context to my problem I'm trying to solve and describes why I'm using fq and why the list of fields maybe long because there is a case in which a user may belong to N - 1 views. Steve On Wed, May 20, 2015 at 11:14 AM, Jack Krupansky jack.krupan...@gmail.com wrote: The uf parameter is used to specify which fields a user may query against - the qf parameter specifies the set of fields that an unfielded query term must be queried against. The user is free to specify fielded query terms, like field1:term1 OR field2:term2. So, which use case are you really talking about. Could you try to point to a real-world example of where your use case might apply, so we can relate to it? Generally, I would say that a Solr document/collection should have no more than low hundreds of fields. It's not that you absolutely can't have more or absolutely can't have 5,000 or more, but simply that you will be asking for trouble, for example, with the cost of comprehending and maintaining and communicating your solution with others, including this mailing list for support. What specifically pushed you to have documents with 1500 field? Also, is this 1500 fields that are always populated, or are there really a larger number of different record types, each with a relatively small number of fields populated in a particular document? -- Jack Krupansky On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve
Re: Looking up arrays in a sub-entity
I was able to get what I wanted by processing the column in question as massaged text, so that it was a comma-delimited series of IDs, and then passing that to a subentity query that went something like: SELECT value FROM othertable WHERE id IN (${master.ids}). It's slow but I think it's getting the job done. For better performance I would probably script something to feed Solr instead of using the DIH. -- View this message in context: http://lucene.472066.n3.nabble.com/Looking-up-arrays-in-a-sub-entity-tp4206380p4206592.html Sent from the Solr - User mailing list archive at Nabble.com.
[ANN] Relevant Search -- The Book on Search Relevance
Hello fellow Solr users, We're writing a book on applied Lucene search relevance -- Relevant Search (http://manning.com/turnbull). We want to teach you to improve the quality of your Solr search results! We're trying to bridge the academic side of Information Retrieval from books like Intro. to IR ( http://www-nlp.stanford.edu/IR-book/) and Lucene-based search engines like Solr and Elasticsearch. Manning is offering discount code *39turnbull* to the Solr mailing list readers to get 39% off all formats (http://manning.com/turnbull). You can preview parts/ideas of our book here: http://java.dzone.com/articles/solr-and-elasticsearch http://opensourceconnections.com/blog/2015/05/15/relevance-data-modeling/ https://medium.com/@softwaredoug/search-is-eating-the-world-1c3dbdfe9b83 Our chapters seem to be taking the form of 1/3 broad relevance tuning philosophy, 2/3 useful examples. While we build a lot of our examples with Elasticsearch, we're also working to try bridge them to Solr as well in the final book so the book can apply to both audiences. After all, almost every idea is translatable between both search engines that share the same Lucene core. If you get into the book, we'd be open to your ideas (or even help:-p) on how to best do this from the community. Happy Searching! -Doug Turnbull John Berryman -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: When is too many fields in qf is too many?
Thanks Shawn. I have already switched to using POST because I need to send a long list of data in qf. My question isn't about POST / GET, it's about Solr and Lucene having to deal with such long list of fields. Here is the text of my question reposted: Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. Steve On Wed, May 20, 2015 at 10:52 AM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 6:27 AM, Steven White wrote: My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. You have two choices when queries become that large. One is to increase the max HTTP header size in the servlet container. In most containers, webservers, and proxy servers, this defaults to 8192 bytes. This is an approach that works very well, but will not scale to extremely large sizes. I have done this on my indexes, because I regularly have queries in the 20K range, but I do not expect them to get very much larger than this. The other option is to switch to sending a POST instead of a GET. The default max POST size that Solr sets is 2MB, which is plenty for just about any query, and can be increased easily to much larger sizes. If you are using SolrJ, switching to POST is very easy ... you'd need to research to figure out how if you're using another framework. Thanks, Shawn
Re: Suggestion on field type
Thank you all... You all are experts... I will go with double as this seems to be more feasible. Regards On Tue, May 19, 2015 at 7:26 PM, Walter Underwood wun...@wunderwood.org wrote: A field type based on BigDecimal could be useful, but that would be a fair amount more work. Double is usually sufficient for big data analysis, especially if you are doing simple aggregates (which is most of what Solr can do). If you want to do something fancier, you’ll need a database, not a search engine. As I usually do, I’ll recommend MarkLogic, which is pretty awesome stuff. Solr would not be in my top handful of solutions for big data analysis. Personally, I’d stuff it all in JSON in Amazon S3 and run map-reduce against it. If you need to do something like that, you could store a JSON blob in Solr with the exact values, and use approximate fields to narrow things down. Of course, MarkLogic has a graceful interface to Hadoop. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 19, 2015, at 4:09 PM, Erick Erickson erickerick...@gmail.com wrote: Well, double is all you've got, so that's what you have to work with. _Every_ float is an approximation when you get out to some number of decimal places, so you don't really have any choice. Of course it'll affect the result. The question is whether it affects the result enough to matter which is application-specific. Best, Erick On Tue, May 19, 2015 at 12:05 PM, Vishal Swaroop vishal@gmail.com wrote: Also 10481.5711458735456*79* indexes to 10481.571145873546 using double fieldType name=double class=solr.TrieDoubleField precisionStep=0 positionIncrementGap=0 omitNorms=false/ On Tue, May 19, 2015 at 2:57 PM, Vishal Swaroop vishal@gmail.com wrote: Thanks Erick... I can ignore the trailing zeros I am indexing data from Vertica database... Though *double *is very close but it SOLR indexes 14 digits after decimal e.g. actual db value is 15 digits after decimal i.e. 249.81735425382405*2* SOLR indexes 14 digits after decimal i.e. 249.81735425382405 As these values will be used for big data analysis, so I am wondering if it might impact the result. fieldType name=double class=solr.TrieDoubleField precisionStep=0 positionIncrementGap=0 omitNorms=false/ Any suggestions ? Regards On Tue, May 19, 2015 at 1:41 PM, Erick Erickson erickerick...@gmail.com wrote: Why do you want to keep trailing zeros? The original input is preserved in the stored portion and will be returned if you specify the field in your fl list. I'm assuming here that you're looking at the actual indexed terms, and don't really understand why the trailing zeros are important Do not use strings. Best Erick On Tue, May 19, 2015 at 10:22 AM, Vishal Swaroop vishal@gmail.com wrote: Thank you John and Jack... Looks like double is much closer... it removes trailing zeros... a) Is there a way to keep trailing zeros double : 194.846189733028000 indexes to 194.846189733028 fieldType name=double class=solr.TrieDoubleField precisionStep=0 positionIncrementGap=0 omitNorms=false/ b) If I use String then will there be issue doing range query float fieldType name=float class=solr.TrieFloatField precisionStep=0 positionIncrementGap=0 omitNorms=false/ 277.677836785372000 indexes to 277.67783 On Tue, May 19, 2015 at 11:56 AM, Jack Krupansky jack.krupan...@gmail.com wrote: double (solr.TrieDoubleField) gives more precision See: https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/schema/TrieDoubleField.html -- Jack Krupansky On Tue, May 19, 2015 at 11:27 AM, Vishal Swaroop vishal@gmail.com wrote: Please suggest which numeric field type to use so that I can get complete value. e.g value in database is : 194.846189733028000 If I index it as float SOLR indexes it as 194.84619 where as I need complete value i.e 194.846189733028000 I will also be doing range query on this field. fieldType name=float class=solr.TrieFloatField precisionStep=0 positionIncrementGap=0/ field name=value type=float indexed=true stored=true multiValued=false / Regards
Re: scoreMode ToParentBlockJoinQuery
Hello, Here is the patch https://issues.apache.org/jira/browse/SOLR-5882 On Tue, May 12, 2015 at 1:11 PM, StrW_dev r.j.bamb...@structweb.nl wrote: Hi Is it possible to configure the scoreMode of the Parent block join query parser (ToParentBlockJoinQuery)? It seems it's set to none, while i would require max in this case. What I want is to filter on child documents, but still use the relevance/boost of these child documents in the final score. Gr. -- View this message in context: http://lucene.472066.n3.nabble.com/scoreMode-ToParentBlockJoinQuery-tp4205020.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: When is too many fields in qf is too many?
On 5/20/2015 9:24 AM, Steven White wrote: I have already switched to using POST because I need to send a long list of data in qf. My question isn't about POST / GET, it's about Solr and Lucene having to deal with such long list of fields. Here is the text of my question reposted: Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. You may need to increase maxBooleanClauses beyond the default of 1024. There will be a message in the log if that is required. Note that such an increase must happen on EVERY config you have, or one of them may set it back to the 1024 default -- it's a global JVM-wide config. Large complex queries are usually slow, requiring more memory and CPU than simple queries, but if you have the resources, Solr will handle it just fine. Thanks, Shawn
Re: solr 5.x on glassfish/tomcat instead of jetty
Shawn I agree with you, but, some of the decisions in the corporate world are handed down through higher powers/pay grade, who do not always like to hear counter arguments. For example, this is the same reason why govt/federal restrict tech folks only use certified DBs/App Servers like Oracle,WSAD etc (Not to say that govt teams are not using SOLR, I know library of congress etc use it.). Some times the decision is above my pay grade more so when the firm is not a core Technology firm. I would rather find a way than be labeled an anarchist, after all anything is possible with software right !!?? ;-) Hope you have already viewed The Expert video on YouTube :-) Thanks Ravi Kiran Bhaskar On Wed, May 20, 2015 at 11:21 AM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 9:07 AM, Ravi Solr wrote: I have read that solr 5.x has moved away from deployable WAR architecture to a runnable Java Application architecture. Our infrastructure/standards folks are adamant about not running SOLR on Jetty (as we are about to upgrade from 4.7.2 to 5.1), any ideas on how I can make it run on Glassfish or at least on Tomcat ?? And do I have to watch for any gotchas regarding the different containers or the upgrade itself ? Would love to hear from people who have already treaded down that path. I really need to finish the wiki page on this topic. As of right now, there is still a .war file. Look in the server/webapps directory for the .war, server/lib/ext for logging jars, and server/resources for the logging configuration. Consult your container's documentation to learn where to place these things. At some point in the future, such deployments will no longer be possible, which is why the docs say you can't do it, even though you can. The project is preparing users for the eventual reality with a documentation change. I'm wondering ... if Jetty is good enough for the Google App Engine, why isn't it good enough for your infrastructure standards? It is the only container that gets testing ... I assure you that there are no tests in the Solr source code that make sure Glassfish works. Thanks, Shawn
Re: Edismax
I highly recommend using boost= in edismax rather than bq=. The multiplicative boost is stable with a wide range of scores. bq is additive and has problems with high or low scores. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:04 PM, John Blythe j...@curvolabs.com wrote: Hi all, I've been fine tuning our current Solr implementation the last week or two to get more precise results. We are trying to get our implementation accurate enough to serve as a lightweight machine learning (obviously a misnomer) implementation of sorts. Actual user generated searching is far secondary for our purposes. I've gotten our results to go from confidence scores of ~40-60 for good results to the 700s. So far so good. Edismax seems like it has some promising features, but I'm wondering if it'll be very helpful for our purposes. The only thing that jumps out immediately to me is the bq ability in which one of our non-primary fields is used as a means of boosting. In other words, when using our three fields—manufacturer, part number, and description—to find a part, we could bq the category or size field to help eliminate false positives from appearing. Is there anything else that you think I should look into regarding edismax that could be helpful to our end game? Thanks for any ideas!
Re: Edismax
could i do that the same way as my mention of using bq? the docs aren't very rich in their example or explanation of boost= here: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser thanks! -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 3:13 PM, Walter Underwood wun...@wunderwood.org wrote: I highly recommend using boost= in edismax rather than bq=. The multiplicative boost is stable with a wide range of scores. bq is additive and has problems with high or low scores. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:04 PM, John Blythe j...@curvolabs.com wrote: Hi all, I've been fine tuning our current Solr implementation the last week or two to get more precise results. We are trying to get our implementation accurate enough to serve as a lightweight machine learning (obviously a misnomer) implementation of sorts. Actual user generated searching is far secondary for our purposes. I've gotten our results to go from confidence scores of ~40-60 for good results to the 700s. So far so good. Edismax seems like it has some promising features, but I'm wondering if it'll be very helpful for our purposes. The only thing that jumps out immediately to me is the bq ability in which one of our non-primary fields is used as a means of boosting. In other words, when using our three fields—manufacturer, part number, and description—to find a part, we could bq the category or size field to help eliminate false positives from appearing. Is there anything else that you think I should look into regarding edismax that could be helpful to our end game? Thanks for any ideas!
Upgrading question
We've been using Solr a bit now for a year or so, 4.6 is the oldest version of Solr we've deployed. We're currently working through the process we'll use to upgrade to 5.1, an upgrade we need for the new facet.stats capabilities. Reading the Major Changes document, it indicates that there is no longer support for Lucene/Solr 3.x and earlier indexes. It also indicates that you should use the IndexUpgrader included with Solr 4.10 if you're unsure. We've only ever deployed 4.6, and 4.9 Solr installations. Am I safe to assume that we can skip the optimize step and just upgrade to Solr 5.1, perhaps optimizing after we've done that? Thanks, Craig Longman C++ Developer This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, notify the sender immediately by return email and delete the message and any attachments from your system.
Re: Error on grouping result set
Possibly you changed the field type sometime without completely blowing away your index and re-indexing from scratch? Based on: unexpected docvalues type SORTED_SET for field 'vendor' (expected=SORTED) Because you can't group on multi-valued fields, which is I think what's going on here. Either that or you have some replicas that aren't coming up, based on No live SolrServers available to handle this request Best, Erick On Wed, May 20, 2015 at 5:55 AM, Abhijit Deka abhijit.d...@rocketmail.com wrote: Hi, I am having some problem whille grouping the result set.I have a solr schema like this fields field name=id type=string indexed=false stored=true required=true / field name=product type=string indexed=true stored=true required=true / field name=vendor type=string indexed=true stored=true required=true / field name=language type=string indexed=true stored=true required=true / field name=TotalInvoices type=float indexed=true stored=true required=true/ /fieldsI am querying the schema and the result is like this product,Vendor,Invoice abc,vendor1,49206.758 abc,vendor2,35654.981 abc,vendor2,94861.258 abc,vendor3,990.96012 abc,vendor3,990.96012 abc,vendor3,990.9601 I want to group the result by the vendor field so I post a query like this http://localhost:8983/solr/gettingstarted_shard2_replica2/select?q=abc fl=product%2Cvendor%2CTotalInvoices wt=json indent=true debugQuery=true group=true group.field=vendor I am getting an error for this in the debug field. error:{ msg:org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://10.192.17.110:7574/solr/gettingstarted_shard2_replica1, http://10.192.17.110:8983/solr/gettingstarted_shard1_replica2, http://10.192.17.110:7574/solr/gettingstarted_shard1_replica1, http://10.192.17.110:8983/solr/gettingstarted_shard2_replica2];, trace:org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://10.192.17.110:7574/solr/gettingstarted_shard2_replica1, http://10.192.17.110:8983/solr/gettingstarted_shard1_replica2, http://10.192.17.110:7574/solr/gettingstarted_shard1_replica1, http://10.192.17.110:8983/solr/gettingstarted_shard2_replica2]\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:342)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1984)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:247)\n\tat org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:210)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:368)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)\n\tat
Re: Edismax
I believe that boost is a superset of the bq functionality. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:16 PM, John Blythe j...@curvolabs.com wrote: could i do that the same way as my mention of using bq? the docs aren't very rich in their example or explanation of boost= here: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser thanks! -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 3:13 PM, Walter Underwood wun...@wunderwood.org wrote: I highly recommend using boost= in edismax rather than bq=. The multiplicative boost is stable with a wide range of scores. bq is additive and has problems with high or low scores. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:04 PM, John Blythe j...@curvolabs.com wrote: Hi all, I've been fine tuning our current Solr implementation the last week or two to get more precise results. We are trying to get our implementation accurate enough to serve as a lightweight machine learning (obviously a misnomer) implementation of sorts. Actual user generated searching is far secondary for our purposes. I've gotten our results to go from confidence scores of ~40-60 for good results to the 700s. So far so good. Edismax seems like it has some promising features, but I'm wondering if it'll be very helpful for our purposes. The only thing that jumps out immediately to me is the bq ability in which one of our non-primary fields is used as a means of boosting. In other words, when using our three fields—manufacturer, part number, and description—to find a part, we could bq the category or size field to help eliminate false positives from appearing. Is there anything else that you think I should look into regarding edismax that could be helpful to our end game? Thanks for any ideas!
Re: ConfigSets and SolrCloud
What is it? There isn't one except zkcli and variants ;). Things are all automatic once you get things _to_ Zookeeper, but pushing the config sets up is a manual process. The usual process is to have the configs in some VCS somewhere so they're safe, and do the usual checkout/edit/checkin and at some point push them to ZK. Then they will be automatically distributed to all the relevant Solr nodes whenever the cores get reloaded, often done with the collections API RELOAD command. Of course you can cheat in dev environments, at least in IntelliJ, by downloading the Zookeeper plugin that allows you to edit the files directly on Zookeeper, but that's certainly NOT recommended for production of course. Best, Erick On Wed, May 20, 2015 at 10:57 AM, Jim.Musil jim.mu...@target.com wrote: Hi, I need a little clarification on configSets in solr 5.x. According to this page: https://cwiki.apache.org/confluence/display/solr/Config+Sets I can create named configSets to be shared by other cores. If I create them using this method AND am operating in SolrCloud mode, will it automatically upload these named config sets to zookeeper? Thanks! Jim Musil
Re: [solr 5.1] Looking for full text + collation search field
Hi Bjorn, solr.ICUCollationField is useful for *sorting*, and you cannot sort on tokenized fields. Your example looks like diacritics insensitive search. Please see : ASCIIFoldingFilterFactory Ahmet On Wednesday, May 20, 2015 2:53 PM, Björn Keil deeph...@web.de wrote: Hello, might anyone suggest a field type with which I may do both a full text search (i.e. there is an analyzer including a tokenizer) and apply a collation? An example for what I want to do: There is a field composer for which I passed the value Dvořák, Antonín. I want the following queries to match: composer:(antonín dvořák) composer:dvorak composer:dvorak, antonin the latter case is possible using a solr.ICUCollationField, but that type does not support an Analyzer and consequently no tokenizer, thus, it is not helpful. Unlike former versions of solr there do not seem to be CollationKeyFilters which you may hang into the analyzer of a solr.TextField... so I am a bit at a loss how I get *both* a tokenizer and a collation at the same time. Thanks for help, Björn
Re: Edismax
On 5/20/2015 2:54 PM, John Blythe wrote: new question re edismax: when i turn it on (in solr admin) my score goes wayy down. from 772 to 4.9. what in the edismax query parser would account for that huge nosedive? Scores are 100% relative, and the number only has meaning in the context of that specific query. You cannot compare scores from one query to scores from another query done with different parameters, especially if it's using a different query parser, and expect those numbers to mean anything. The actual number is doesn't matter ... what matters is how the documents score compared to *each other* -- what order the documents have within a single result. Thanks, Shawn
Re: Edismax
thanks guys. it doesn't depend on absolute scores, but it is leaning on the score as a confident metric of sorts. we've found some good standard deviation info when plotting out the accuracy of the top result and the relative score with the analyzers currently in production and hope to strengthen that confidence when it's right and lower it when it's wrong with the latest fine-tuning. so far so good, too. regarding the new question itself, i'd replied to this thread w more info but had the system kick it back to me for some reason. maybe i replied too much too soon? anyway, it ended up being a result of my query still being in the primary query box instead of moving it to the q.alt box. i'd thought the alt was indicative of it being an *alternate* query strictly speaking. changed it to house the query and voila! thanks- -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 4:23 PM, Walter Underwood wun...@wunderwood.org wrote: I was going to post the same advice. If your approach depends on absolute scores, you need to change your approach. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 2:54 PM, John Blythe wrote: new question re edismax: when i turn it on (in solr admin) my score goes wayy down. from 772 to 4.9. what in the edismax query parser would account for that huge nosedive? Scores are 100% relative, and the number only has meaning in the context of that specific query. You cannot compare scores from one query to scores from another query done with different parameters, especially if it's using a different query parser, and expect those numbers to mean anything. The actual number is doesn't matter ... what matters is how the documents score compared to *each other* -- what order the documents have within a single result. Thanks, Shawn
Edismax
Hi all, I've been fine tuning our current Solr implementation the last week or two to get more precise results. We are trying to get our implementation accurate enough to serve as a lightweight machine learning (obviously a misnomer) implementation of sorts. Actual user generated searching is far secondary for our purposes. I've gotten our results to go from confidence scores of ~40-60 for good results to the 700s. So far so good. Edismax seems like it has some promising features, but I'm wondering if it'll be very helpful for our purposes. The only thing that jumps out immediately to me is the bq ability in which one of our non-primary fields is used as a means of boosting. In other words, when using our three fields—manufacturer, part number, and description—to find a part, we could bq the category or size field to help eliminate false positives from appearing. Is there anything else that you think I should look into regarding edismax that could be helpful to our end game? Thanks for any ideas!
Re: Problem using a function with a multivalued field
bq: Keep a copy of the value into a non-multi-valued field, using an update processor: This involves indexing a new field Why can't you do this? You can't re-index the data perhaps? It's by far the easiest solution Best, Erick On Wed, May 20, 2015 at 2:45 AM, Fernando Agüero fjagu...@gmail.com wrote: Hi everyone, I’ve been reading answers around this problem but I wanted to make sure that there is another way out of my problem. The thing is that the solution shouldn’t be on index-time, involve indexing a new field or changing this multi-valued field to a single-valued one. Problem: I need to run a custom function with some fields but I see that it’s not possible to get the value (first value in this case) of a multivalued field. “title” is a multi-valued field. See: if(exists(title),strdist(title, “string1),0). This throws the “can’t use FieldCache on a multivalued field” error. Solutions that doesn’t work for me: - Keep a copy of the value into a non-multi-valued field, using an update processor: This involves indexing a new field. - Change the field to multiValued=false: This involves using a single-valued field. I will be indexing new data in the future and I need some fields to be multi-valued but I also need to work with them. Thanks in advance, I spent a lot of time with this without a solution. I’m using Solr 4.10.
Re: Edismax
cool, will check into it some more via testing -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 3:22 PM, Walter Underwood wun...@wunderwood.org wrote: I believe that boost is a superset of the bq functionality. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:16 PM, John Blythe j...@curvolabs.com wrote: could i do that the same way as my mention of using bq? the docs aren't very rich in their example or explanation of boost= here: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser thanks! -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 3:13 PM, Walter Underwood wun...@wunderwood.org wrote: I highly recommend using boost= in edismax rather than bq=. The multiplicative boost is stable with a wide range of scores. bq is additive and has problems with high or low scores. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:04 PM, John Blythe j...@curvolabs.com wrote: Hi all, I've been fine tuning our current Solr implementation the last week or two to get more precise results. We are trying to get our implementation accurate enough to serve as a lightweight machine learning (obviously a misnomer) implementation of sorts. Actual user generated searching is far secondary for our purposes. I've gotten our results to go from confidence scores of ~40-60 for good results to the 700s. So far so good. Edismax seems like it has some promising features, but I'm wondering if it'll be very helpful for our purposes. The only thing that jumps out immediately to me is the bq ability in which one of our non-primary fields is used as a means of boosting. In other words, when using our three fields—manufacturer, part number, and description—to find a part, we could bq the category or size field to help eliminate false positives from appearing. Is there anything else that you think I should look into regarding edismax that could be helpful to our end game? Thanks for any ideas!
Re: Edismax
I was going to post the same advice. If your approach depends on absolute scores, you need to change your approach. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 2:54 PM, John Blythe wrote: new question re edismax: when i turn it on (in solr admin) my score goes wayy down. from 772 to 4.9. what in the edismax query parser would account for that huge nosedive? Scores are 100% relative, and the number only has meaning in the context of that specific query. You cannot compare scores from one query to scores from another query done with different parameters, especially if it's using a different query parser, and expect those numbers to mean anything. The actual number is doesn't matter ... what matters is how the documents score compared to *each other* -- what order the documents have within a single result. Thanks, Shawn
Re: Edismax
new question re edismax: when i turn it on (in solr admin) my score goes wayy down. from 772 to 4.9. what in the edismax query parser would account for that huge nosedive? -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 3:27 PM, John Blythe j...@curvolabs.com wrote: cool, will check into it some more via testing -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 3:22 PM, Walter Underwood wun...@wunderwood.org wrote: I believe that boost is a superset of the bq functionality. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:16 PM, John Blythe j...@curvolabs.com wrote: could i do that the same way as my mention of using bq? the docs aren't very rich in their example or explanation of boost= here: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser thanks! -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 3:13 PM, Walter Underwood wun...@wunderwood.org wrote: I highly recommend using boost= in edismax rather than bq=. The multiplicative boost is stable with a wide range of scores. bq is additive and has problems with high or low scores. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:04 PM, John Blythe j...@curvolabs.com wrote: Hi all, I've been fine tuning our current Solr implementation the last week or two to get more precise results. We are trying to get our implementation accurate enough to serve as a lightweight machine learning (obviously a misnomer) implementation of sorts. Actual user generated searching is far secondary for our purposes. I've gotten our results to go from confidence scores of ~40-60 for good results to the 700s. So far so good. Edismax seems like it has some promising features, but I'm wondering if it'll be very helpful for our purposes. The only thing that jumps out immediately to me is the bq ability in which one of our non-primary fields is used as a means of boosting. In other words, when using our three fields—manufacturer, part number, and description—to find a part, we could bq the category or size field to help eliminate false positives from appearing. Is there anything else that you think I should look into regarding edismax that could be helpful to our end game? Thanks for any ideas!
Re: Need help with Nested docs situation
data scale and request rate can judge between block, plain joins and field collapsing. On Thu, Apr 30, 2015 at 1:07 PM, roySolr royrutten1...@gmail.com wrote: Hello, I have a situation and i'm a little bit stuck on the way how to fix it. For example the following data structure: *Deal* All Coca Cola 20% off *Products* Coca Cola light Coca Cola Zero 1L Coca Cola Zero 20CL Coca Cola 1L When somebody search to Cola discount i want the result of the deal with related products. Solution #1: I could index it with nested docs(solr 4.9). But the problem is when a product has some changes(let's say Zero gets a new name Extra Light) i have to re-index every deal with these products. Solution #2: I could make 2 collections, one with deals and one with products. A Product will get a parentid(dealid). Then i have to do 2 queries to get the information? When i have a resultpage with 10 deals i want to preview the first 2 products. That means a lot of queries but it's doesn't have the update problem from solution #1. Does anyone have a good solution for this? Thanks, any help is appreciated. Roy -- View this message in context: http://lucene.472066.n3.nabble.com/Need-help-with-Nested-docs-situation-tp4203190.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: When is too many fields in qf is too many?
Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the summation. But something easy to try if you want to keep playing with dismax. -Doug On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/ It wouldn't be hard (and actually a great feature for the project) to get the Lucene query associated with cross field search into Solr. You could easily write a plugin to integrate it into a query parser: https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java Hope that helps -Doug -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: Help to index nested document
I'm absolutely sure that you need to group them externally in the indexer eg like a child VALUES entity in DataImportHandler. On Mon, May 11, 2015 at 9:52 PM, Vishal Swaroop vishal@gmail.com wrote: Need your valuable inputs... I am indexing data from database (one table) which is in this example format : id name value 1 Joe 102724904 2 Joe 100996643 - id is primary/ unique key - there can be same name but different value - If I try name as unique key then SOLR removes duplicate and indexes 1 document - I am getting the result in this format... Is there as way I can index data in a way so that I can value can be child for name... response: { numFound: 2, start: 0, docs: [ { id: 1, name: Joe, value: [ 102724904 ] }, { id: 2, name: Joe, value: [ 100996643 ] }... Expected format : docs: [ { name: Joe, value: [ 102724904, 100996643 ] } -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr Cloud: No live SolrServers available
Seems like the attachements get stripped off. Anyways, here is the 4.7 log on startup INFO - 2015-05-20 10:35:45.786; org.eclipse.jetty.server.Server; jetty-8.1.10.v20130312 INFO - 2015-05-20 10:35:45.804; org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor C:\apps\solr\solr-4.7.2\example\contexts at interval 0 INFO - 2015-05-20 10:35:45.811; org.eclipse.jetty.deploy.DeploymentManager; Deployable added: C:\apps\solr\solr-4.7.2\example\contexts\solr-jetty-context.xml INFO - 2015-05-20 10:35:47.405; org.eclipse.jetty.webapp.StandardDescriptorProcessor; NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet INFO - 2015-05-20 10:35:47.460; org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() INFO - 2015-05-20 10:35:47.472; org.apache.solr.core.SolrResourceLoader; JNDI not configured for solr (NoInitialContextEx) INFO - 2015-05-20 10:35:47.473; org.apache.solr.core.SolrResourceLoader; solr home defaulted to 'solr/' (could not find system property or JNDI) INFO - 2015-05-20 10:35:47.473; org.apache.solr.core.SolrResourceLoader; new SolrResourceLoader for directory: 'solr/' INFO - 2015-05-20 10:35:47.579; org.apache.solr.core.ConfigSolr; Loading container configuration from C:\apps\solr\solr-4.7.2\example\solr\solr.xml INFO - 2015-05-20 10:35:47.674; org.apache.solr.core.CorePropertiesLocator; Config-defined core root directory: C:\apps\solr\solr-4.7.2\example\solr INFO - 2015-05-20 10:35:47.680; org.apache.solr.core.CoreContainer; New CoreContainer 1930610653 INFO - 2015-05-20 10:35:47.680; org.apache.solr.core.CoreContainer; Loading cores into CoreContainer [instanceDir=solr/] INFO - 2015-05-20 10:35:47.691; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting socketTimeout to: 0 INFO - 2015-05-20 10:35:47.691; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting urlScheme to: null INFO - 2015-05-20 10:35:47.695; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting connTimeout to: 0 INFO - 2015-05-20 10:35:47.695; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting maxConnectionsPerHost to: 20 INFO - 2015-05-20 10:35:47.696; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting corePoolSize to: 0 INFO - 2015-05-20 10:35:47.696; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting maximumPoolSize to: 2147483647 INFO - 2015-05-20 10:35:47.697; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting maxThreadIdleTime to: 5 INFO - 2015-05-20 10:35:47.697; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting sizeOfQueue to: -1 INFO - 2015-05-20 10:35:47.697; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting fairnessPolicy to: false INFO - 2015-05-20 10:35:47.931; org.apache.solr.logging.LogWatcher; SLF4J impl is org.slf4j.impl.Log4jLoggerFactory INFO - 2015-05-20 10:35:47.932; org.apache.solr.logging.LogWatcher; Registering Log Listener [Log4j (org.slf4j.impl.Log4jLoggerFactory)] INFO - 2015-05-20 10:35:47.933; org.apache.solr.core.CoreContainer; Host Name: INFO - 2015-05-20 10:35:47.946; org.apache.solr.cloud.SolrZkServer; STARTING EMBEDDED STANDALONE ZOOKEEPER SERVER at port 9987 INFO - 2015-05-20 10:35:48.447; org.apache.solr.core.ZkContainer; Zookeeper client=localhost:9987 INFO - 2015-05-20 10:35:48.489; org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect to ZooKeeper INFO - 2015-05-20 10:35:48.500; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@2bc25a1d name:ZooKeeperConnection Watcher:localhost:9987 got event WatchedEvent state:SyncConnected type:None path:null path:null type:None INFO - 2015-05-20 10:35:48.500; org.apache.solr.common.cloud.ConnectionManager; Client is connected to ZooKeeper INFO - 2015-05-20 10:35:48.516; org.apache.solr.common.cloud.ZkStateReader; Updating cluster state from ZooKeeper... INFO - 2015-05-20 10:35:49.529; org.apache.solr.cloud.ZkController; Register node as live in ZooKeeper:/live_nodes/10.1.172.231:8987_solr INFO - 2015-05-20 10:35:49.536; org.apache.solr.cloud.ZkController; Found a previous node that still exists while trying to register a new live node /live_nodes/10.1.172.231:8987_solr - removing existing node to create another. INFO - 2015-05-20 10:35:49.537; org.apache.solr.common.cloud.SolrZkClient; makePath: /live_nodes/10.1.172.231:8987_solr INFO - 2015-05-20 10:35:49.537; org.apache.solr.common.cloud.ZkStateReader$3; Updating live nodes... (0) INFO - 2015-05-20 10:35:49.544; org.apache.solr.common.cloud.ZkStateReader$3; Updating live nodes... (1) INFO - 2015-05-20 10:35:49.581; org.apache.solr.common.cloud.SolrZkClient; makePath: /configs/myapp47/schema.xml INFO - 2015-05-20 10:35:49.596; org.apache.solr.common.cloud.SolrZkClient; makePath: /configs/myapp47/solrconfig.xml INFO - 2015-05-20 10:35:49.636; org.apache.solr.core.CorePropertiesLocator; Looking for core
ConfigSets and SolrCloud
Hi, I need a little clarification on configSets in solr 5.x. According to this page: https://cwiki.apache.org/confluence/display/solr/Config+Sets I can create named configSets to be shared by other cores. If I create them using this method AND am operating in SolrCloud mode, will it automatically upload these named config sets to zookeeper? Thanks! Jim Musil
Re: When is too many fields in qf is too many?
Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/ It wouldn't be hard (and actually a great feature for the project) to get the Lucene query associated with cross field search into Solr. You could easily write a plugin to integrate it into a query parser: https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java Hope that helps -Doug -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve
Re: When is too many fields in qf is too many?
Thanks for calling out maxBooleanClauses. The current default of 1024 has not caused me any issues (so far) in my testing. However, you probably saw Doug Tumbull's reply, it looks like my relevance will suffer. Steve On Wed, May 20, 2015 at 11:42 AM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 9:24 AM, Steven White wrote: I have already switched to using POST because I need to send a long list of data in qf. My question isn't about POST / GET, it's about Solr and Lucene having to deal with such long list of fields. Here is the text of my question reposted: Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. You may need to increase maxBooleanClauses beyond the default of 1024. There will be a message in the log if that is required. Note that such an increase must happen on EVERY config you have, or one of them may set it back to the 1024 default -- it's a global JVM-wide config. Large complex queries are usually slow, requiring more memory and CPU than simple queries, but if you have the resources, Solr will handle it just fine. Thanks, Shawn
Re: solr 5.x on glassfish/tomcat instead of jetty
Shawn Heisey apa...@elyograg.org wrote: I'm wondering ... if Jetty is good enough for the Google App Engine, why isn't it good enough for your infrastructure standards? Replace Jetty vs. Glassfish with Linux vs. Windows, Eclipse vs. Idea, emacs vs. vi, Java vs. C#... There are many reasons for a corporation to prefer one product over another. One common one is the wish to support as few different platforms as possible: Better the devil you know. We're still on Solr 4.x and deploy it in a tomcat, as that is what Operations prefer to use. From their perspective, Solr is just another thing to run among all the other WARs we throw at them. We will switch away from tomcat when upgrading to Solr 5, but our upgrade has been delayed so far (partly) because of that change. This is a recurring discussion. A list of the merits drawbacks of going WAR-less (or more to the point: Require Solr to be run as an application instead of in a generic container) might be an idea? - Toke Eskildsen
Re: Edismax
John: The spam filter is very aggressive. Try changing the type to plain text rather than rich text or html... Best, Erick On Wed, May 20, 2015 at 2:35 PM, John Blythe j...@curvolabs.com wrote: thanks guys. it doesn't depend on absolute scores, but it is leaning on the score as a confident metric of sorts. we've found some good standard deviation info when plotting out the accuracy of the top result and the relative score with the analyzers currently in production and hope to strengthen that confidence when it's right and lower it when it's wrong with the latest fine-tuning. so far so good, too. regarding the new question itself, i'd replied to this thread w more info but had the system kick it back to me for some reason. maybe i replied too much too soon? anyway, it ended up being a result of my query still being in the primary query box instead of moving it to the q.alt box. i'd thought the alt was indicative of it being an *alternate* query strictly speaking. changed it to house the query and voila! thanks- -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 4:23 PM, Walter Underwood wun...@wunderwood.org wrote: I was going to post the same advice. If your approach depends on absolute scores, you need to change your approach. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 2:54 PM, John Blythe wrote: new question re edismax: when i turn it on (in solr admin) my score goes wayy down. from 772 to 4.9. what in the edismax query parser would account for that huge nosedive? Scores are 100% relative, and the number only has meaning in the context of that specific query. You cannot compare scores from one query to scores from another query done with different parameters, especially if it's using a different query parser, and expect those numbers to mean anything. The actual number is doesn't matter ... what matters is how the documents score compared to *each other* -- what order the documents have within a single result. Thanks, Shawn
Re: Edismax
Good call thank you On Wed, May 20, 2015 at 5:15 PM, Erick Erickson erickerick...@gmail.com wrote: John: The spam filter is very aggressive. Try changing the type to plain text rather than rich text or html... Best, Erick On Wed, May 20, 2015 at 2:35 PM, John Blythe j...@curvolabs.com wrote: thanks guys. it doesn't depend on absolute scores, but it is leaning on the score as a confident metric of sorts. we've found some good standard deviation info when plotting out the accuracy of the top result and the relative score with the analyzers currently in production and hope to strengthen that confidence when it's right and lower it when it's wrong with the latest fine-tuning. so far so good, too. regarding the new question itself, i'd replied to this thread w more info but had the system kick it back to me for some reason. maybe i replied too much too soon? anyway, it ended up being a result of my query still being in the primary query box instead of moving it to the q.alt box. i'd thought the alt was indicative of it being an *alternate* query strictly speaking. changed it to house the query and voila! thanks- -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 4:23 PM, Walter Underwood wun...@wunderwood.org wrote: I was going to post the same advice. If your approach depends on absolute scores, you need to change your approach. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 2:54 PM, John Blythe wrote: new question re edismax: when i turn it on (in solr admin) my score goes wayy down. from 772 to 4.9. what in the edismax query parser would account for that huge nosedive? Scores are 100% relative, and the number only has meaning in the context of that specific query. You cannot compare scores from one query to scores from another query done with different parameters, especially if it's using a different query parser, and expect those numbers to mean anything. The actual number is doesn't matter ... what matters is how the documents score compared to *each other* -- what order the documents have within a single result. Thanks, Shawn
Re: Edismax
On 5/20/2015 3:35 PM, John Blythe wrote: regarding the new question itself, i'd replied to this thread w more info but had the system kick it back to me for some reason. maybe i replied too much too soon? anyway, it ended up being a result of my query still being in the primary query box instead of moving it to the q.alt box. i'd thought the alt was indicative of it being an *alternate* query strictly speaking. changed it to house the query and voila! As Erick said, it may have been classified as spam and discarded. His advice of using plain text instead of HTML or rich text is one of the top things to try. If you actually received a bounce message, that bounce should have information about why it was rejected. The q.alt parameter is an alternate query, in *lucene* parser syntax, that dismax or edismax will execute when the q parameter is empty or missing. It is quite common to use q.alt=*:* in the handler defaults so that if you omit the q parameter or send an empty string, you get all docs. If there is a non-empty q parameter, q.alt is ignored. Thanks, Shawn
Re: Reindex of document leaves old fields behind
Well, let's see the code. Standard updates should replace the previous docs, reindexing the same unique ID with fewer fields should show fewer fields. So something's weird here. Although do, just for yucks, issue a query on some of the unique ids in question, I'd be curious if you get more than one back which would tell us something. Did you push your schema up to Zookeeper and reload (or restart) your collection before re-indexing things? And are you sure the documents are actually getting indexed and that the update is succeeding? (check your Solr logs probably here). Best, Erick On Wed, May 20, 2015 at 5:12 PM, tuxedomoon dancolem...@yahoo.com wrote: The uniqueKey value is the same. The new documents contain fewer fields than the already indexed ones. Could this cause the updates to be treated as atomic? With the persisting fields treated as un-updated? Routing should be implicit since the collection was created using numShards. Many request for the same document with cache busting produce the same unwanted fields, so I doubt the correct one is hiding somewhere. I can also see the timestamp going up with each reindex. -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206732.html Sent from the Solr - User mailing list archive at Nabble.com.
Reindex of document leaves old fields behind
I'm reindexing Mongo docs into SolrCloud. The new docs have had a few fields removed so upon reindexing those fields should be gone in Solr. They are not. So the result is a new doc merged with an old doc rather than a replacement which is what I need. I do not know whether the issue is with my SolrJ client, Solr config or something else. -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Reindex of document leaves old fields behind
The uniqueKey value is the same. The new documents contain fewer fields than the already indexed ones. Could this cause the updates to be treated as atomic? With the persisting fields treated as un-updated? Routing should be implicit since the collection was created using numShards. Many request for the same document with cache busting produce the same unwanted fields, so I doubt the correct one is hiding somewhere. I can also see the timestamp going up with each reindex. -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206732.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 5.x on glassfish/tomcat instead of jetty
On 5/20/15, 8:21 AM, Shawn Heisey wrote: As of right now, there is still a .war file. Look in the server/webapps directory for the .war, server/lib/ext for logging jars, and server/resources for the logging configuration. Consult your container's documentation to learn where to place these things. At some point in the future, such deployments will no longer be possible, While we are still at this subject, I have been aware there has been an anti-WAR movement in the tech but I don't quite understand where this movement is coming from. Can someone point me to some website summarizing why WARs are bad? Thanks.
Re: solr 5.x on glassfish/tomcat instead of jetty
Never mind. I found that thread. Sorry for the noise. On 5/20/15, 5:56 PM, TK Solr wrote: On 5/20/15, 8:21 AM, Shawn Heisey wrote: As of right now, there is still a .war file. Look in the server/webapps directory for the .war, server/lib/ext for logging jars, and server/resources for the logging configuration. Consult your container's documentation to learn where to place these things. At some point in the future, such deployments will no longer be possible, While we are still at this subject, I have been aware there has been an anti-WAR movement in the tech but I don't quite understand where this movement is coming from. Can someone point me to some website summarizing why WARs are bad? Thanks.
SolrCloud Leader Election
My SolrCloud cluster isn't reassigning the collections leaders from downed cores--the downed cores are still listed as the leaders. The cluster has been in the state for a few hours and the logs continue to report No registered leader was found after waiting for 4000ms. Is there a way to force it to reassign the leader? I'm running SolrCloud 5.0. I have 7 Solr nodes, 3 Zookeeper nodes, and 3739 collections. Thanks, Ryan --- This email has been scanned for email related threats and delivered safely by Mimecast. For more information please visit http://www.mimecast.com ---
Re: SolrCloud delete by query performance
GC is operating the way I think it should but I am lacking memory. I am just surprised because indexing is performing fine (documents going in) but deletions are really bad (documents coming out). Is it possible these deletes are hitting many segments, each of which I assume must be re-built? And if there isn't much slack memory laying around to begin with, there's a bunch of contention/swap? Thanks Shawn! On Wed, May 20, 2015 at 4:50 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 5:41 PM, Ryan Cutter wrote: I have a collection with 1 billion documents and I want to delete 500 of them. The collection has a dozen shards and a couple replicas. Using Solr 4.4. Sent the delete query via HTTP: http://hostname:8983/solr/my_collection/update?stream.body= deletequerysource:foo/query/delete Took a couple minutes and several replicas got knocked into Recovery mode. They eventually came back and the desired docs were deleted but the cluster wasn't thrilled (high load, etc). Is this expected behavior? Is there a better way to delete documents that I'm missing? That's the correct way to do the delete. Before you'll see the change, a commit must happen in one way or another. Hopefully you already knew that. I believe that your setup has some performance issues that are making it very slow and knocking out your Solr nodes temporarily. The most common root problems with SolrCloud and indexes going into recovery are: 1) Your heap is enormous but your garbage collection is not tuned. 2) You don't have enough RAM, separate from your Java heap, for adequate index caching. With a billion documents in your collection, you might even be having problems with both. Here's a wiki page that includes some info on both of these problems, plus a few others: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
Re: Upgrading question
Yep. Solr/Lucene strives for one major revision backwards compatibility. So any 5x should be able to read any index produced with 4x, but no index produced with 3x. Best, Erick On Wed, May 20, 2015 at 2:44 PM, Craig Longman clong...@iconect.com wrote: We've been using Solr a bit now for a year or so, 4.6 is the oldest version of Solr we've deployed. We're currently working through the process we'll use to upgrade to 5.1, an upgrade we need for the new facet.stats capabilities. Reading the Major Changes document, it indicates that there is no longer support for Lucene/Solr 3.x and earlier indexes. It also indicates that you should use the IndexUpgrader included with Solr 4.10 if you're unsure. We've only ever deployed 4.6, and 4.9 Solr installations. Am I safe to assume that we can skip the optimize step and just upgrade to Solr 5.1, perhaps optimizing after we've done that? Thanks, Craig Longman C++ Developer This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, notify the sender immediately by return email and delete the message and any attachments from your system.
Re: Edismax
A few things: Scores aren't confidence metrics, they are relative rankings, in relation to a single resultset, that's all. Secondly for edismax, boost does multiplicative boosting (whatever function you provide, the score is multiplied by that), whereas bf does additive boosting. Upayavira On Wed, May 20, 2015, at 11:15 PM, Erick Erickson wrote: John: The spam filter is very aggressive. Try changing the type to plain text rather than rich text or html... Best, Erick On Wed, May 20, 2015 at 2:35 PM, John Blythe j...@curvolabs.com wrote: thanks guys. it doesn't depend on absolute scores, but it is leaning on the score as a confident metric of sorts. we've found some good standard deviation info when plotting out the accuracy of the top result and the relative score with the analyzers currently in production and hope to strengthen that confidence when it's right and lower it when it's wrong with the latest fine-tuning. so far so good, too. regarding the new question itself, i'd replied to this thread w more info but had the system kick it back to me for some reason. maybe i replied too much too soon? anyway, it ended up being a result of my query still being in the primary query box instead of moving it to the q.alt box. i'd thought the alt was indicative of it being an *alternate* query strictly speaking. changed it to house the query and voila! thanks- -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 4:23 PM, Walter Underwood wun...@wunderwood.org wrote: I was going to post the same advice. If your approach depends on absolute scores, you need to change your approach. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 2:54 PM, John Blythe wrote: new question re edismax: when i turn it on (in solr admin) my score goes wayy down. from 772 to 4.9. what in the edismax query parser would account for that huge nosedive? Scores are 100% relative, and the number only has meaning in the context of that specific query. You cannot compare scores from one query to scores from another query done with different parameters, especially if it's using a different query parser, and expect those numbers to mean anything. The actual number is doesn't matter ... what matters is how the documents score compared to *each other* -- what order the documents have within a single result. Thanks, Shawn
Re: Reindex of document leaves old fields behind
On 5/20/2015 4:43 PM, tuxedomoon wrote: I'm reindexing Mongo docs into SolrCloud. The new docs have had a few fields removed so upon reindexing those fields should be gone in Solr. They are not. So the result is a new doc merged with an old doc rather than a replacement which is what I need. I do not know whether the issue is with my SolrJ client, Solr config or something else. Do those documents have the same value in the uniqueKey field? It must be an exact match -- a deviation in upper/lower case will be treated as a new document. If they do have identical information in the uniqueKey field, then there are a few possible problems: Are you indexing full documents, or are you doing Atomic Updates? An atomic update is by definition a change, not a replacement ... so unless that change includes deleting fields, they would not be affected. https://wiki.apache.org/solr/Atomic_Updates If your collection has multiple shards, then this paragraph may apply: What is the router set to on the collection? If it is implicit then you may have indexed the new document to a different shard, which means that it is now in your index more than once, and which one gets returned may not be predictable. The same thing may be true if you are using composite routing and including information in the key that sends the document to a different shard from where it was originally indexed. Thanks, Shawn
SolrCloud delete by query performance
I have a collection with 1 billion documents and I want to delete 500 of them. The collection has a dozen shards and a couple replicas. Using Solr 4.4. Sent the delete query via HTTP: http://hostname:8983/solr/my_collection/update?stream.body= deletequerysource:foo/query/delete Took a couple minutes and several replicas got knocked into Recovery mode. They eventually came back and the desired docs were deleted but the cluster wasn't thrilled (high load, etc). Is this expected behavior? Is there a better way to delete documents that I'm missing? Thanks, Ryan
Re: SolrCloud delete by query performance
On 5/20/2015 5:41 PM, Ryan Cutter wrote: I have a collection with 1 billion documents and I want to delete 500 of them. The collection has a dozen shards and a couple replicas. Using Solr 4.4. Sent the delete query via HTTP: http://hostname:8983/solr/my_collection/update?stream.body= deletequerysource:foo/query/delete Took a couple minutes and several replicas got knocked into Recovery mode. They eventually came back and the desired docs were deleted but the cluster wasn't thrilled (high load, etc). Is this expected behavior? Is there a better way to delete documents that I'm missing? That's the correct way to do the delete. Before you'll see the change, a commit must happen in one way or another. Hopefully you already knew that. I believe that your setup has some performance issues that are making it very slow and knocking out your Solr nodes temporarily. The most common root problems with SolrCloud and indexes going into recovery are: 1) Your heap is enormous but your garbage collection is not tuned. 2) You don't have enough RAM, separate from your Java heap, for adequate index caching. With a billion documents in your collection, you might even be having problems with both. Here's a wiki page that includes some info on both of these problems, plus a few others: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
Re: SolrCloud delete by query performance
On 5/20/2015 5:57 PM, Ryan Cutter wrote: GC is operating the way I think it should but I am lacking memory. I am just surprised because indexing is performing fine (documents going in) but deletions are really bad (documents coming out). Is it possible these deletes are hitting many segments, each of which I assume must be re-built? And if there isn't much slack memory laying around to begin with, there's a bunch of contention/swap? A deleteByQuery must first query the entire index to determine which IDs to delete. That's going to hit every segment. In the case of SolrCloud, it will also hit at least one replica of every single shard in the collection. If the data required to satisfy the query is not already sitting in the OS disk cache, then the actual disk must be read. When RAM is extremely tight, any disk operation will erase relevant data out of the OS disk cache, so the next time it is needed, it must be read off the disk again. Disks are SLOW. What I am describing is not swap, but the performance impact is similar to swapping. The actual delete operation (once the IDs are known) doesn't touch any segments ... it writes Lucene document identifiers to a .del file, and that file is consulted on all queries. Any deleted documents found in the query results are removed. Thanks, Shawn
Re: SolrCloud delete by query performance
Shawn, thank you very much for that explanation. It helps a lot. Cheers, Ryan On Wed, May 20, 2015 at 5:07 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 5:57 PM, Ryan Cutter wrote: GC is operating the way I think it should but I am lacking memory. I am just surprised because indexing is performing fine (documents going in) but deletions are really bad (documents coming out). Is it possible these deletes are hitting many segments, each of which I assume must be re-built? And if there isn't much slack memory laying around to begin with, there's a bunch of contention/swap? A deleteByQuery must first query the entire index to determine which IDs to delete. That's going to hit every segment. In the case of SolrCloud, it will also hit at least one replica of every single shard in the collection. If the data required to satisfy the query is not already sitting in the OS disk cache, then the actual disk must be read. When RAM is extremely tight, any disk operation will erase relevant data out of the OS disk cache, so the next time it is needed, it must be read off the disk again. Disks are SLOW. What I am describing is not swap, but the performance impact is similar to swapping. The actual delete operation (once the IDs are known) doesn't touch any segments ... it writes Lucene document identifiers to a .del file, and that file is consulted on all queries. Any deleted documents found in the query results are removed. Thanks, Shawn
solr 5.x on glassfish/tomcat instead of jetty
I have read that solr 5.x has moved away from deployable WAR architecture to a runnable Java Application architecture. Our infrastructure/standards folks are adamant about not running SOLR on Jetty (as we are about to upgrade from 4.7.2 to 5.1), any ideas on how I can make it run on Glassfish or at least on Tomcat ?? And do I have to watch for any gotchas regarding the different containers or the upgrade itself ? Would love to hear from people who have already treaded down that path. Thanks Ravi Kiran Bhaskar
Re: When is too many fields in qf is too many?
The uf parameter is used to specify which fields a user may query against - the qf parameter specifies the set of fields that an unfielded query term must be queried against. The user is free to specify fielded query terms, like field1:term1 OR field2:term2. So, which use case are you really talking about. Could you try to point to a real-world example of where your use case might apply, so we can relate to it? Generally, I would say that a Solr document/collection should have no more than low hundreds of fields. It's not that you absolutely can't have more or absolutely can't have 5,000 or more, but simply that you will be asking for trouble, for example, with the cost of comprehending and maintaining and communicating your solution with others, including this mailing list for support. What specifically pushed you to have documents with 1500 field? Also, is this 1500 fields that are always populated, or are there really a larger number of different record types, each with a relatively small number of fields populated in a particular document? -- Jack Krupansky On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve
Re: When is too many fields in qf is too many?
Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/ It wouldn't be hard (and actually a great feature for the project) to get the Lucene query associated with cross field search into Solr. You could easily write a plugin to integrate it into a query parser: https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java Hope that helps -Doug -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve
Re: solr 5.x on glassfish/tomcat instead of jetty
On 5/20/2015 9:07 AM, Ravi Solr wrote: I have read that solr 5.x has moved away from deployable WAR architecture to a runnable Java Application architecture. Our infrastructure/standards folks are adamant about not running SOLR on Jetty (as we are about to upgrade from 4.7.2 to 5.1), any ideas on how I can make it run on Glassfish or at least on Tomcat ?? And do I have to watch for any gotchas regarding the different containers or the upgrade itself ? Would love to hear from people who have already treaded down that path. I really need to finish the wiki page on this topic. As of right now, there is still a .war file. Look in the server/webapps directory for the .war, server/lib/ext for logging jars, and server/resources for the logging configuration. Consult your container's documentation to learn where to place these things. At some point in the future, such deployments will no longer be possible, which is why the docs say you can't do it, even though you can. The project is preparing users for the eventual reality with a documentation change. I'm wondering ... if Jetty is good enough for the Google App Engine, why isn't it good enough for your infrastructure standards? It is the only container that gets testing ... I assure you that there are no tests in the Solr source code that make sure Glassfish works. Thanks, Shawn
Re: Solr Cloud: No live SolrServers available
Erick Thanks for your response. Logs don't seem to show any explicit errors (I have log level at INFO). I am attaching the logs from a 4.7 start and a 5.1 start here. Note that both logs seem to show the shards as Down initially but for 5.1, the state change to Active later on. Also, note that all the config files, libraries, jarfiles etc are the same for both Solr instances. Regards On Tue, May 19, 2015 at 11:57 AM, Erick Erickson erickerick...@gmail.com wrote: What you've done _looks_ correct at a glance. Take a look at the Solr logs. Don't bother trying to index things unless and until your nodes are active, it won't happen. My first guess is that you have some error in your schema or solrconfig.xml files, syntax errors, typos, class names that are mis-typed, jars that are missing, whatever. If that's true, the Solr log (or the screen if you're just running from the command line) will show big ugly stack traces. If nothing shows up in the logs then I'm puzzled, but what you describe is consistent with what I've seen in terms of having bad configs and trying to create a collection. Best, Erick On Tue, May 19, 2015 at 4:33 AM, Chetan Vora chetanv...@gmail.com wrote: Hi all We have a cluster of standalone Solr cores (Solr 4.3) for which we had built some custom plugins. I'm now trying to prototype converting the cluster to a Solr Cloud cluster. This is how I am trying to deploy the cores (in 4.7.2). 1. Start solr with zookeeper embedded. java -DzkRun -Djetty.port=8985 -jar start.jar 2. upload a config into Zookeeper (same config as the standalone cores) zkcli.bat -zkhost localhost:9985 -cmd upconfig -confdir myconfig -confname myconfig 3. Create a new collection (mycollection) of 2 shards using the Collections API http://localhost:8985/solr/admin/collections?action=CREATEname=mycollectionnumShards=2replicationFactor=1maxShardsPerNode=2collection.configName=myconfig So at this point I have two shards under my solr directory with the appropriate core.properties But when I go to http://localhost:8985/solr/#/~cloud, I see that the two shards' status is Down when they are supposed to be active by default. And when I try to index documents in them using SolrJ (via CloudSolrServer API) , I get the error No live SolrServers available to handle this request. I restarted Solr but same issue. private CloudSolrServer cloudSolr; cloudSolr = new CloudSolrServer(zkHOST); cloudSolr.setZkClientTimeout(zkClientTimeout); cloudSolr.setDefaultCollection(collectionName); cloudSolr.connect(); cloudSolr.add(doc) What am I doing wrong? I did a lot of digging around and saw an old Jira bug saying that Solr Cloud shards won't be active until there are some documents in the index. If that is the reason, that's kind of like a catch-22 isn't it? So anyways, I also tried adding some test documents manually and committed to see if things improved. Now on the shard statistics page, it correctly gives me the Numdocs count but when I try to query it says no servers hosting shard. I next tried passing in shards.tolerant=true as a query parameter and search, but no cigar. It says 0 documents found. Any help would be appreciated. My main objective is to rebuilt the old standalone cores using SolrCloud and test to see if our custom requesthandlers still work as expected. And at this point, I can't index documents inside of the 4.7 Solr Cloud collection I have created. I am trying to use a 4.x SolrCloud release as it seems the internal APIs have changed quite a bit for the 5.x releases and our custom requesthandlers don't work anymore as expected. Thanks and Regards
Re: When is too many fields in qf is too many?
On 5/20/2015 6:27 AM, Steven White wrote: My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. You have two choices when queries become that large. One is to increase the max HTTP header size in the servlet container. In most containers, webservers, and proxy servers, this defaults to 8192 bytes. This is an approach that works very well, but will not scale to extremely large sizes. I have done this on my indexes, because I regularly have queries in the 20K range, but I do not expect them to get very much larger than this. The other option is to switch to sending a POST instead of a GET. The default max POST size that Solr sets is 2MB, which is plenty for just about any query, and can be increased easily to much larger sizes. If you are using SolrJ, switching to POST is very easy ... you'd need to research to figure out how if you're using another framework. Thanks, Shawn