Re: Bad fieldNorm when using morphologic synonyms
Attached patch into the JIRA issue. Reviews are welcome. On Thu, Dec 19, 2013 at 7:24 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Roman, do you have any results? created SOLR-5561 Robert, if I'm wrong, you are welcome to close that issue. On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh isaac.he...@gmail.comwrote: You can see the norm value, in the explain text, when setting debugQuery=true. If the same item gets different norm before/after, that's it. Note that this configuration is in schema.xml (not solrconfig.xml...) On Monday, December 9, 2013, Roman Chyla wrote: Isaac, is there an easy way to recognize this problem? We also index synonym tokens in the same position (like you do, and I'm sure that our positions are set correctly). I could test whether the default similarity factory in solrconfig.xml had any effect (before/after reindexing). --roman On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi Robert and Manuel. The DefaultSimilarity indeed sets discountOverlap to true by default. BUT, the *factory*, aka DefaultSimilarityFactory, when called by IndexSchema (the getSimilarity method), explicitly sets this value to the value of its corresponding class member. This class member is initialized to be FALSE when the instance is created (like every boolean variable in the world). It should be set when init method is called. If the parameter is not set in schema.xml, the default is true. Everything seems to be alright, but the issue is that init method is NOT called, if the similarity is not *explicitly* declared in schema.xml. In that case, init method is not called, the discountOverlaps member (of the factory class) remains FALSE, and getSimilarity explicitly calls setDiscountOverlaps with value of FALSE. This is very easy to reproduce and debug. On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote: no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: LocalParam for nested query without escaping?
created SOLR-5560 On Tue, Dec 10, 2013 at 8:48 AM, William Bell billnb...@gmail.com wrote: Sounds like a bug. On Mon, Dec 9, 2013 at 1:16 PM, Isaac Hebsh isaac.he...@gmail.com wrote: If so, can someone suggest how a query should be escaped (securely and correctly)? Should I escape the quote mark (and backslash mark itself) only? On Fri, Dec 6, 2013 at 2:59 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Obviously, there is the option of external parameter ({... v=$nestedq}nestedq=...) This is a good solution, but it is not practical, when having a lot of such nested queries. Any ideas? On Friday, December 6, 2013, Isaac Hebsh wrote: We want to set a LocalParam on a nested query. When quering with v inline parameter, it works fine: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND{!lucenedf=text v=TERM2 TERM3 \TERM4 TERM5\} the parsedquery_toString is +id:TERM1 +(text:term2 text:term3 text:term4 term5) Query using the _query_ also works fine: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND_query_:{!lucene df=text}TERM2 TERM3 \TERM4 TERM5\ (parsedquery is exactly the same). BUT, when trying to put the nested query in place, it yields syntax error: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND{!lucenedf=text}(TERM2 TERM3 TERM4 TERM5) org.apache.solr.search.SyntaxError: Cannot parse '(TERM2' The previous options are less preferred, because the escaping that should be made on the nested query. Can't I set a LocalParam to a nested query without escaping the query? -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Bad fieldNorm when using morphologic synonyms
Roman, do you have any results? created SOLR-5561 Robert, if I'm wrong, you are welcome to close that issue. On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh isaac.he...@gmail.com wrote: You can see the norm value, in the explain text, when setting debugQuery=true. If the same item gets different norm before/after, that's it. Note that this configuration is in schema.xml (not solrconfig.xml...) On Monday, December 9, 2013, Roman Chyla wrote: Isaac, is there an easy way to recognize this problem? We also index synonym tokens in the same position (like you do, and I'm sure that our positions are set correctly). I could test whether the default similarity factory in solrconfig.xml had any effect (before/after reindexing). --roman On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi Robert and Manuel. The DefaultSimilarity indeed sets discountOverlap to true by default. BUT, the *factory*, aka DefaultSimilarityFactory, when called by IndexSchema (the getSimilarity method), explicitly sets this value to the value of its corresponding class member. This class member is initialized to be FALSE when the instance is created (like every boolean variable in the world). It should be set when init method is called. If the parameter is not set in schema.xml, the default is true. Everything seems to be alright, but the issue is that init method is NOT called, if the similarity is not *explicitly* declared in schema.xml. In that case, init method is not called, the discountOverlaps member (of the factory class) remains FALSE, and getSimilarity explicitly calls setDiscountOverlaps with value of FALSE. This is very easy to reproduce and debug. On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote: no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
Hi Robert and Manuel. The DefaultSimilarity indeed sets discountOverlap to true by default. BUT, the *factory*, aka DefaultSimilarityFactory, when called by IndexSchema (the getSimilarity method), explicitly sets this value to the value of its corresponding class member. This class member is initialized to be FALSE when the instance is created (like every boolean variable in the world). It should be set when init method is called. If the parameter is not set in schema.xml, the default is true. Everything seems to be alright, but the issue is that init method is NOT called, if the similarity is not *explicitly* declared in schema.xml. In that case, init method is not called, the discountOverlaps member (of the factory class) remains FALSE, and getSimilarity explicitly calls setDiscountOverlaps with value of FALSE. This is very easy to reproduce and debug. On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote: no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: LocalParam for nested query without escaping?
If so, can someone suggest how a query should be escaped (securely and correctly)? Should I escape the quote mark (and backslash mark itself) only? On Fri, Dec 6, 2013 at 2:59 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Obviously, there is the option of external parameter ({... v=$nestedq}nestedq=...) This is a good solution, but it is not practical, when having a lot of such nested queries. Any ideas? On Friday, December 6, 2013, Isaac Hebsh wrote: We want to set a LocalParam on a nested query. When quering with v inline parameter, it works fine: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND {!lucene df=text v=TERM2 TERM3 \TERM4 TERM5\} the parsedquery_toString is +id:TERM1 +(text:term2 text:term3 text:term4 term5) Query using the _query_ also works fine: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND _query_:{!lucene df=text}TERM2 TERM3 \TERM4 TERM5\ (parsedquery is exactly the same). BUT, when trying to put the nested query in place, it yields syntax error: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND {!lucene df=text}(TERM2 TERM3 TERM4 TERM5) org.apache.solr.search.SyntaxError: Cannot parse '(TERM2' The previous options are less preferred, because the escaping that should be made on the nested query. Can't I set a LocalParam to a nested query without escaping the query?
Re: Global query parameters to facet query
created SOLR-5542. Anyone else want it? On Thu, Dec 5, 2013 at 8:55 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, It seems that a facet query does not use the global query parameters (for example, field aliasing for edismax parser). We have an intensive use of facet queries (in some cases, we have a lot of facet.query for a single q), and the using of LocalParams for each facet.query is not convenient. Did I miss a normal way to solve it? Did anyone else encountered this requirement?
Re: Bad fieldNorm when using morphologic synonyms
You can see the norm value, in the explain text, when setting debugQuery=true. If the same item gets different norm before/after, that's it. Note that this configuration is in schema.xml (not solrconfig.xml...) On Monday, December 9, 2013, Roman Chyla wrote: Isaac, is there an easy way to recognize this problem? We also index synonym tokens in the same position (like you do, and I'm sure that our positions are set correctly). I could test whether the default similarity factory in solrconfig.xml had any effect (before/after reindexing). --roman On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.comjavascript:; wrote: Hi Robert and Manuel. The DefaultSimilarity indeed sets discountOverlap to true by default. BUT, the *factory*, aka DefaultSimilarityFactory, when called by IndexSchema (the getSimilarity method), explicitly sets this value to the value of its corresponding class member. This class member is initialized to be FALSE when the instance is created (like every boolean variable in the world). It should be set when init method is called. If the parameter is not set in schema.xml, the default is true. Everything seems to be alright, but the issue is that init method is NOT called, if the similarity is not *explicitly* declared in schema.xml. In that case, init method is not called, the discountOverlaps member (of the factory class) remains FALSE, and getSimilarity explicitly calls setDiscountOverlaps with value of FALSE. This is very easy to reproduce and debug. On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.comjavascript:; wrote: no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com javascript:; wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
1) positions look all right (for me). 2) fieldNorm is determined by the size of the termVector, isn't it? the termVector size isn't affected by the positions. On Fri, Dec 6, 2013 at 10:46 AM, Robert Muir rcm...@gmail.com wrote: Your analyzer needs to set positionIncrement correctly: sounds like its broken. On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, we implemented a morphologic analyzer, which stems words on index time. For some reasons, we index both the original word and the stem (on the same position, of course). The stemming is done on a specific language, so other languages are not stemmed at all. Because of that, two documents with the same amount of terms, may have different termVector size. document which contains many words that being stemmed, will have a double sized termVector. This behaviour affects the relevance score in a BAD way. the fieldNorm of these documents reduces thier score. This is NOT the wanted behaviour in our case. We are looking for a way to mark the stemmed words (on index time, of course) so they won't affect the fieldNorm. Do such a way exist? Do you have another idea?
LocalParam for nested query without escaping?
We want to set a LocalParam on a nested query. When quering with v inline parameter, it works fine: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND {!lucene df=text v=TERM2 TERM3 \TERM4 TERM5\} the parsedquery_toString is +id:TERM1 +(text:term2 text:term3 text:term4 term5) Query using the _query_ also works fine: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND _query_:{!lucene df=text}TERM2 TERM3 \TERM4 TERM5\ (parsedquery is exactly the same). BUT, when trying to put the nested query in place, it yields syntax error: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND {!lucene df=text}(TERM2 TERM3 TERM4 TERM5) org.apache.solr.search.SyntaxError: Cannot parse '(TERM2' The previous options are less preferred, because the escaping that should be made on the nested query. Can't I set a LocalParam to a nested query without escaping the query?
Re: LocalParam for nested query without escaping?
Obviously, there is the option of external parameter ({... v=$nestedq}nestedq=...) This is a good solution, but it is not practical, when having a lot of such nested queries. Any ideas? On Friday, December 6, 2013, Isaac Hebsh wrote: We want to set a LocalParam on a nested query. When quering with v inline parameter, it works fine: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND {!lucene df=text v=TERM2 TERM3 \TERM4 TERM5\} the parsedquery_toString is +id:TERM1 +(text:term2 text:term3 text:term4 term5) Query using the _query_ also works fine: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND _query_:{!lucene df=text}TERM2 TERM3 \TERM4 TERM5\ (parsedquery is exactly the same). BUT, when trying to put the nested query in place, it yields syntax error: http://localhost:8983/solr/collection1/select?debugQuery=truedefType=lucenedf=idq=TERM1AND {!lucene df=text}(TERM2 TERM3 TERM4 TERM5) org.apache.solr.search.SyntaxError: Cannot parse '(TERM2' The previous options are less preferred, because the escaping that should be made on the nested query. Can't I set a LocalParam to a nested query without escaping the query?
Bad fieldNorm when using morphologic synonyms
Hi, we implemented a morphologic analyzer, which stems words on index time. For some reasons, we index both the original word and the stem (on the same position, of course). The stemming is done on a specific language, so other languages are not stemmed at all. Because of that, two documents with the same amount of terms, may have different termVector size. document which contains many words that being stemmed, will have a double sized termVector. This behaviour affects the relevance score in a BAD way. the fieldNorm of these documents reduces thier score. This is NOT the wanted behaviour in our case. We are looking for a way to mark the stemmed words (on index time, of course) so they won't affect the fieldNorm. Do such a way exist? Do you have another idea?
Global query parameters to facet query
Hi, It seems that a facet query does not use the global query parameters (for example, field aliasing for edismax parser). We have an intensive use of facet queries (in some cases, we have a lot of facet.query for a single q), and the using of LocalParams for each facet.query is not convenient. Did I miss a normal way to solve it? Did anyone else encountered this requirement?
Re: Bad fieldNorm when using morphologic synonyms
The field is our main textual field. In the standard case, the length-normalization makes a significant work with tf-idf, we don't want to avoid it. Removing duplicates won't help here, because the terms are not dup. One term is stemmed, and the other is not. On Fri, Dec 6, 2013 at 9:48 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Isaac, Did you consider omitting norms completely for that field? omitNorms=true Are you using solr.RemoveDuplicatesTokenFilterFactory? On Thursday, December 5, 2013 8:55 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, we implemented a morphologic analyzer, which stems words on index time. For some reasons, we index both the original word and the stem (on the same position, of course). The stemming is done on a specific language, so other languages are not stemmed at all. Because of that, two documents with the same amount of terms, may have different termVector size. document which contains many words that being stemmed, will have a double sized termVector. This behaviour affects the relevance score in a BAD way. the fieldNorm of these documents reduces thier score. This is NOT the wanted behaviour in our case. We are looking for a way to mark the stemmed words (on index time, of course) so they won't affect the fieldNorm. Do such a way exist? Do you have another idea?
Re: Solr Result Tagging
Hi, Try using facet.query on each part, you will get the number of total hits for every OR. If you need this info per document, the answers might appear when specifying debug query=true.. If that info is useful, try adding [explain] to fl param (probably requires registering the augmenter plugin in solrconfig) - Isaac. On Friday, October 25, 2013, Cool Techi wrote: Hi, My search queries to solr are of the following nature, (A OR B OR C) OR (X AND Y AND Z) OR ((ABC AND DEF) - XYZ) What I am trying to achieve is when I fire the query the results returned should be able to tagged with which part or the OR resulted in the result. In case all three parts above are applicable then the result should indicate the same. I tried group.query feature, but doesn't seem like it works on solr cloud. Thanks,Ayush
Re: Profiling Solr Lucene for query
Hi Dmitry, I'm trying to examine your suggestion to create a frontend node. It sounds pretty usefull. I saw that every node in solr cluster can serve request for any collection, even if it does not hold a core of that collection. because of that, I thought that adding a new node to the cluster (aka, the frontend/gateway server), and creating a dummy collection (with 1 dummy core), will solve the problem. But, I see that a request which sent to the gateway node, is not then sent to the shards. Instead, the request is proxyed to a (random) core of the requested collection, and from there it is sent to the shards. (It is reasonable, because the SolrCore on the gateway might run with different configuration, etc). This means that my new node isn't functioning as a frontend (which responsible for sorting, etc.), but as a poor load balancer. No performance improvement will come from this implementation. So, how do you suggest to implement a frontend? On the one hand, it has to run a core of the target collection, but on the other hand, we don't want it to hold any shard contents. On Fri, Sep 13, 2013 at 1:08 PM, Dmitry Kan solrexp...@gmail.com wrote: Manuel, Whether to have the front end solr as aggregator of shard results depends on your requirements. To repeat, we found merging from many shards very inefficient fo our use case. It can be the opposite for you (i.e. requires testing). There are some limitations with distributed search, see here: http://docs.lucidworks.com/display/solr/Distributed+Search+with+Index+Sharding On Wed, Sep 11, 2013 at 3:35 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Dmitry - currently we don't have such a front end, this sounds like a good idea creating it. And yes, we do query all 36 shards every query. Mikhail - I do think 1 minute is enough data, as during this exact minute I had a single query running (that took a qtime of 1 minute). I wanted to isolate these hard queries. I repeated this profiling few times. I think I will take the termInterval from 128 to 32 and check the results. I'm currently using NRTCachingDirectoryFactory On Mon, Sep 9, 2013 at 11:29 PM, Dmitry Kan solrexp...@gmail.com wrote: Hi Manuel, The frontend solr instance is the one that does not have its own index and is doing merging of the results. Is this the case? If yes, are all 36 shards always queried? Dmitry On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi Dmitry, I have solr 4.3 and every query is distributed and merged back for ranking purpose. What do you mean by frontend solr? On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote: are you querying your shards via a frontend solr? We have noticed, that querying becomes much faster if results merging can be avoided. Dmitry On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java: 2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at
Re: Profiling Solr Lucene for query
Hi Shawn, I know that every node operates as a frontend. This is the way our cluster currently run. If I seperate the frontend from the nodes which hold the shards, I can let him different amount of CPUs as RAM. (e.g. large amount of RAM to JVM, because this server won't need the OS cache for reading the index, or more CPUs because the merging process might be more CPU intensive). Isn't it possible? On Wed, Oct 2, 2013 at 12:42 AM, Shawn Heisey s...@elyograg.org wrote: On 10/1/2013 2:35 PM, Isaac Hebsh wrote: Hi Dmitry, I'm trying to examine your suggestion to create a frontend node. It sounds pretty usefull. I saw that every node in solr cluster can serve request for any collection, even if it does not hold a core of that collection. because of that, I thought that adding a new node to the cluster (aka, the frontend/gateway server), and creating a dummy collection (with 1 dummy core), will solve the problem. But, I see that a request which sent to the gateway node, is not then sent to the shards. Instead, the request is proxyed to a (random) core of the requested collection, and from there it is sent to the shards. (It is reasonable, because the SolrCore on the gateway might run with different configuration, etc). This means that my new node isn't functioning as a frontend (which responsible for sorting, etc.), but as a poor load balancer. No performance improvement will come from this implementation. So, how do you suggest to implement a frontend? On the one hand, it has to run a core of the target collection, but on the other hand, we don't want it to hold any shard contents. With SolrCloud, every node is a frontend node. If you're running SolrCloud, then it doesn't make sense to try and use that concept. It only makes sense to create a frontend node (or core) if you are using traditional distributed search, where you need to include a shards parameter. http://wiki.apache.org/solr/**DistributedSearchhttp://wiki.apache.org/solr/DistributedSearch Thanks, Shawn
Considerations about setting maxMergedSegmentMB
Hi, Trying to solve query performance issue, we suspect on the number of index segments, which might slow the query (due to I/O seeks, happens for each term in the query, multiplied by number of segments). We are on Solr 4.3 (TieredMergePolicy with mergeFactor of 4). We can reduce the number of segments by enlarging maxMergedSegmentMB, from the default 5GB to something bigger (10GB, 15GB?). What are the side effects, which should be considered when doing it? Did anyone changed this setting in PROD for a while?
Re: Data duplication using Cloud+HDFS+Mirroring
Hi Greg, Did you get an answer? I'm interested in the same question. More generally, what are the benefits of HdfsDirectoryFactory, besides the transparent restore of the shard contents in case of a disk failure, and the ability to rebuild index using MR? Is the next statement exact? blocks of a particular shard, which are replicated to another node, will be never queried, since there is no solr core configured to read them. On Wed, Aug 7, 2013 at 8:46 PM, Greg Walters gwalt...@sherpaanalytics.comwrote: While testing Solr's new ability to store data and transaction directories in HDFS I added an additional core to one of my testing servers that was configured as a backup (active but not leader) core for a shard elsewhere. It looks like this extra core copies the data into its own directory rather than just using the existing directory with the data that's already available to it. Since HDFS likely already has redundancy of the data covered via the replicationFactor is there a reason for non-leader cores to create their own data directory rather than doing reads on the existing master copy? I searched Jira for anything that suggests this behavior might change and didn't find any issues; is there any intent to address this? Thanks, Greg
Re: Getting a query parameter in a TokenFilter
Thought about that again, We can do this work as a search component, manipulating the query string. The cons are the double QParser work, and the double tokenization work. Another approach which might solve this issue easily is Dynamic query analyze chain: https://issues.apache.org/jira/browse/SOLR-5053 What would you do? On Tue, Sep 17, 2013 at 10:31 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi everyone, We developed a TokenFilter. It should act differently, depends on a parameter supplied in the query (for query chain only, not the index one, of course). We found no way to pass that parameter into the TokenFilter flow. I guess that the root cause is because TokenFilter is a pure lucene object. As a last resort, we tried to pass the parameter as the first term in the query text (q=...), and save it as a member of the TokenFilter instance. Although it is ugly, it might work fine. But, the problem is that it is not guaranteed that all the terms of a particular query will be analyzed by the same instance of a TokenFilter. In this case, some terms will be analyzed without the required information of that parameter. We can produce such a race very easily. How should I overcome this issue? Do anyone have a better resolution?
Getting a query parameter in a TokenFilter
Hi everyone, We developed a TokenFilter. It should act differently, depends on a parameter supplied in the query (for query chain only, not the index one, of course). We found no way to pass that parameter into the TokenFilter flow. I guess that the root cause is because TokenFilter is a pure lucene object. As a last resort, we tried to pass the parameter as the first term in the query text (q=...), and save it as a member of the TokenFilter instance. Although it is ugly, it might work fine. But, the problem is that it is not guaranteed that all the terms of a particular query will be analyzed by the same instance of a TokenFilter. In this case, some terms will be analyzed without the required information of that parameter. We can produce such a race very easily. How should I overcome this issue? Do anyone have a better resolution?
documentCache and lazyFieldLoading
Hi, We've investigated a memory dump, which was taken after some frequent OOM incidents. The main issue we found was a lot of millions of LazyField instances, taking ~2GB of memory, even though queries request about 10 small fields only. We've found that LazyDocument creates a LazyField object for every item in a multivalued field, even if do not want this field. For example, documents contain a multivalued field, named f, with a lot of values (let's say 100 values per document). Queries set fl=id (request only document id). The documentCache will grow up in memory :( In our case, documentCache was configured to 32000. There are 2 cores per node, so 64000 LazyDocument instances are in memory. This is pretty big number, and we'll reduce it. I'm curious whether it's a known issue or not? and why should the LazyDocument know the amount of values in a multivalued field which is not requested? Another thought which I had: Is it reasonable to add something like {!cache=false} which will affect documentCache. For example. If my query request id only, with a big rows parameter, I don't want documentCache to hold these big LazyDocument objects. Did anyone else encounter this?
Re: documentCache and lazyFieldLoading
Thanks Hoss. 1. We currently use Solr 4.3.0. 2. I understand this architecture of LazyFields, but i did not understand why multiple LazyFields should be created for the multivalued field. You can't load a part of them. If you request the field, you will get ALL of its values. so 100 (or more) placeholders are not necessary in this case. Moreover, why should Solr KNOW how much values are in that unloaded field? 3. In our poor case, we might handle some concurrent queries, each one requests rows=2000. What do you think about temporary disabling documentCache, for a specific query? On Thu, Aug 29, 2013 at 10:11 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : The main issue we found was a lot of millions of LazyField instances, : taking ~2GB of memory, even though queries request about 10 small fields : only. which version of Solr are you using? there was a really bad bug with lazyFieldLoading fixed in Solr 4.2.1 (SOLR-4589) : We've found that LazyDocument creates a LazyField object for every item in : a multivalued field, even if do not want this field. right, that's exactly how lazyFieldLoading is designed to work -- instead of loading the full field values into ram, only a small LazyField object is loaded in it's place and that LazyField only fetches the underlying data if/when it's requested. If the LazyField instances weren't created as placeholders, subsequent requests for the document that *might* request additional fields (beyond the 10 small fields that were requested the first time) would have no way of knowing if/when those additional fields existed to be able to fetch them from the index. : In our case, documentCache was configured to 32000. There are 2 cores per : node, so 64000 LazyDocument instances are in memory. This is pretty big : number, and we'll reduce it. FWIW: Even at 1/10 that size, that seems like a ridiculously large documentCache to me. -Hoss
Re: Sending shard requests to all replicas
Thanks to Ryan Ernst, my issue is duplicate of SOLR-4449. I think that this proposal might be very useful (some supporting links are attached there. worth reading..) On Tue, Jul 30, 2013 at 11:49 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, I submitted a new JIRA for this: https://issues.apache.org/jira/browse/SOLR-5092 A (very initial) patch is already attached. Reviews are very welcome. On Sun, Jul 28, 2013 at 4:50 PM, Erick Erickson erickerick...@gmail.comwrote: You'd probably start in CloudSolrServer in SolrJ code, as far as I know that's where the request is sent out. I'd think that would be better than changing Solr itself since if you found that this was useful you wouldn't be patching your Solr release, just keeping your client up to date. Best Erick On Sat, Jul 27, 2013 at 7:28 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Shawn, thank you for the tips. I know the significant cons of virtualization, but I don't want to move this thread into a virtualization pros/cons in the Solr(Cloud) case. I've just asked what is the minimal code change should be made, in order to examine whether this is a possible solution or not.. :) On Sun, Jul 28, 2013 at 1:06 AM, Shawn Heisey s...@elyograg.org wrote: On 7/27/2013 3:33 PM, Isaac Hebsh wrote: I have about 40 shards. repFactor=2. The cause of slower shards is very interesting, and this is the main approach we took. Note that in every query, it is another shard which is the slowest. In 20% of the queries, the slowest shard takes about 4 times more than the average shard qtime. While continuing investigation, remember it might be the virtualization / storage-access / network / gc /..., so I thought that reducing the effect of the slow shards might be a good (temporary or permanent) solution. Virtualization is not the best approach for Solr. Assuming you're dealing with your own hardware and not something based in the cloud like Amazon, you can get better results by running on bare metal and having multiple shards per host. Garbage collection is a very likely source of this problem. http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems I thought it should be an almost trivial code change (for proving the concept). Isn't it? I have no idea what you're saying/asking here. Can you clarify? It seems to me that sending requests to all replicas would just increase the overall load on the cluster, with no real benefit. Thanks, Shawn
Re: Sending shard requests to all replicas
Hi Erick, thanks. I have about 40 shards. repFactor=2. The cause of slower shards is very interesting, and this is the main approach we took. Note that in every query, it is another shard which is the slowest. In 20% of the queries, the slowest shard takes about 4 times more than the average shard qtime. While continuing investigation, remember it might be the virtualization / storage-access / network / gc /..., so I thought that reducing the effect of the slow shards might be a good (temporary or permanent) solution. I thought it should be an almost trivial code change (for proving the concept). Isn't it? On Sat, Jul 27, 2013 at 6:11 PM, Erick Erickson erickerick...@gmail.comwrote: This has been suggested, but so far it's not been implemented as far as I know. I'm curious though, how many shards are you dealing with? I wonder if it would be a better idea to try to figure out _why_ you so often have a slow shard and whether the problem could be cured with, say, better warming queries on the shards... Best Erick On Fri, Jul 26, 2013 at 8:23 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi! When SolrClound executes a query, it creates shard requests, which is sent to one replica of each shard. Total QTime is determined by the slowest shard response (plus some extra time). [For simplicity, let's assume that no stored fields are requested.] I suffer from a situation where in every query, some shards are much slower than others. We might consider a different approach, which sends the shard request to *ALL* replicas of each shard. Solr will continue when responses are got from at least one replica of each shard. Of course, the amount of work that is wasted is big (multiplied by replicationFactor), but in my case, there are very few concurrent queries, and the most important performance is the qtime. Such a solution might improve qtime significantly. Did someone tried this before? Any tip from where should I start in the code?
Re: Sending shard requests to all replicas
Shawn, thank you for the tips. I know the significant cons of virtualization, but I don't want to move this thread into a virtualization pros/cons in the Solr(Cloud) case. I've just asked what is the minimal code change should be made, in order to examine whether this is a possible solution or not.. :) On Sun, Jul 28, 2013 at 1:06 AM, Shawn Heisey s...@elyograg.org wrote: On 7/27/2013 3:33 PM, Isaac Hebsh wrote: I have about 40 shards. repFactor=2. The cause of slower shards is very interesting, and this is the main approach we took. Note that in every query, it is another shard which is the slowest. In 20% of the queries, the slowest shard takes about 4 times more than the average shard qtime. While continuing investigation, remember it might be the virtualization / storage-access / network / gc /..., so I thought that reducing the effect of the slow shards might be a good (temporary or permanent) solution. Virtualization is not the best approach for Solr. Assuming you're dealing with your own hardware and not something based in the cloud like Amazon, you can get better results by running on bare metal and having multiple shards per host. Garbage collection is a very likely source of this problem. http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems I thought it should be an almost trivial code change (for proving the concept). Isn't it? I have no idea what you're saying/asking here. Can you clarify? It seems to me that sending requests to all replicas would just increase the overall load on the cluster, with no real benefit. Thanks, Shawn
Sending shard requests to all replicas
Hi! When SolrClound executes a query, it creates shard requests, which is sent to one replica of each shard. Total QTime is determined by the slowest shard response (plus some extra time). [For simplicity, let's assume that no stored fields are requested.] I suffer from a situation where in every query, some shards are much slower than others. We might consider a different approach, which sends the shard request to *ALL* replicas of each shard. Solr will continue when responses are got from at least one replica of each shard. Of course, the amount of work that is wasted is big (multiplied by replicationFactor), but in my case, there are very few concurrent queries, and the most important performance is the qtime. Such a solution might improve qtime significantly. Did someone tried this before? Any tip from where should I start in the code?
MoinMoin Dump
Hi, There was a thread about viewing Solr Wiki offline, About 6 months ago. I'm intersted, too. It seems that a manual (cron?) dump will do the work... Would it be too much to ask that one of the admins will manually create such a dump? (http://moinmo.in/HelpOnMoinCommand/ExportDump) Otis, is there any progress made on this in Apache Infra?
Re: Wildcards and Phrase queries
Ahmet, it looks great! Can you tell us why havn't this code been commited into lucene+solr trunk? On Sun, Jun 23, 2013 at 2:28 PM, Ahmet Arslan iori...@yahoo.com wrote: Hi Isaac, ComplexPhrase-4.2.1.zip should work with solr4.2.1. Zipball contains a ReadMe.txt file about instructions. You could try with higher solr versions too. If it does not work, please lets us know. https://issues.apache.org/jira/secure/attachment/12579832/ComplexPhrase-4.2.1.zip From: Isaac Hebsh isaac.he...@gmail.com To: solr-user@lucene.apache.org Sent: Saturday, June 22, 2013 9:33 PM Subject: Re: Wildcards and Phrase queries Thanks Erick. Maybe lucene (java-user) is a better mailing list to ask in? On Sat, Jun 22, 2013 at 7:30 AM, Erick Erickson erickerick...@gmail.com wrote: Wouldn't imagine they're production ready, they haven't been touched in months. So I'd say you're on your own here in terms of whether you wanted to use these for production. I confess I don't know what state they were left in or why they were never committed. FWIW, Erick On Wed, Jun 19, 2013 at 10:08 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, I'm trying to understand what is the status of enabling wildcards on phrase queries? Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486 Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604 It looks like these issues are not going to be solved in the close future :( Will they? Did they came into a (partially) dead-end, in the current approach. Can I contribute anything to make them fixed into an official version? Does the lastest patches which attached to rthe JIRAs are production ready? [Should this message be sent to java-user list?]
Re: Wildcards and Phrase queries
Thanks Erick. Maybe lucene (java-user) is a better mailing list to ask in? On Sat, Jun 22, 2013 at 7:30 AM, Erick Erickson erickerick...@gmail.comwrote: Wouldn't imagine they're production ready, they haven't been touched in months. So I'd say you're on your own here in terms of whether you wanted to use these for production. I confess I don't know what state they were left in or why they were never committed. FWIW, Erick On Wed, Jun 19, 2013 at 10:08 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, I'm trying to understand what is the status of enabling wildcards on phrase queries? Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486 Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604 It looks like these issues are not going to be solved in the close future :( Will they? Did they came into a (partially) dead-end, in the current approach. Can I contribute anything to make them fixed into an official version? Does the lastest patches which attached to rthe JIRAs are production ready? [Should this message be sent to java-user list?]
Wildcards and Phrase queries
Hi, I'm trying to understand what is the status of enabling wildcards on phrase queries? Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486 Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604 It looks like these issues are not going to be solved in the close future :( Will they? Did they came into a (partially) dead-end, in the current approach. Can I contribute anything to make them fixed into an official version? Does the lastest patches which attached to rthe JIRAs are production ready? [Should this message be sent to java-user list?]
OutOfMemory while indexing (PROD environment!)
Hi everyone, My SolrCloud cluster (4.3.0) has came into production a few days ago. Docs are being indexed into Solr using /update requestHandler, as a POST request, containing text/xml content-type. The collection is sharded into 36 pieces, each shard has two replicas. There are 36 nodes (each node on separate virtual machine), so each node holds exactly 2 cores. Each update request contains 100 docs, what means 2-3 docs for each shard. There are 1-2 such requests every minute. Soft-commit happens every 10 minutes, Hard-commit every 30 minutes, and ramBufferSizeMB=128. After 48 hours of zero problems, suddenly one shard went down (its both cores). Log says it's OOM (GC overhead limit exceeded). JVM is set to Xmx=4G. I'm pretty sure that some minutes before this incident, JVM memory wasn't so high (even the max memory usage indicator was below 2G). Indexing requests did not stop, and started getting HTTP 503 errors (no server hosting shard). At this time, some other cores started to go down (l had all of the rainbow colors: Active, Recovering, Down, Recovery Failed and Gone :). Then I tried to restart tomcat of the down nodes, but some of them failed to start, due to the error message: we are not the leader. Only shutting down the both two cores and starting them gradually, solved the problem, and the whole cluster came back to green state. Solr is not yet exposed to users, so no queries have been made at that time (but maybe some non-heavy auto-warm queries were executed). I don't think that all of the 4GB were being used for justifiable reasons.. I guess that adding more RAM will not solve the problem, in the long term. Where should I start my log investigation? (about the OOM itself, and about the chain accident came after it) I did a search for previous similar issues. There are a lot, but most of them talks about very old versions of Solr. [Versions: Solr: 4.3.0 Tomcat 7 JVM: Oracle 7 (last, standard, JRE), 64bit. OS: RedHat 6.3]
Re: Prevention of heavy wildcard queries
Hi everyone. I came across another need for term extraction: I want to find pairs of words that appear in queries together. All of the clustering work is ready. and the only hole is how to get the basic terms from the query. Nobody tried it before? There is no clean way to do it? On Tue, May 28, 2013 at 7:08 AM, Isaac Hebsh isaac.he...@gmail.com wrote: I don't want to affect on the (correctness of the) real query parsing, so creating a QParserPlugin is risky. Instead, If I'll parse the query in my search component, it will be detached from the real query parsing, (obviously this causes double parsing, but assume it's OK)... On Tue, May 28, 2013 at 3:52 AM, Roman Chyla roman.ch...@gmail.comwrote: Hi Issac, it is as you say, with the exception that you create a QParserPlugin, not a search component * create QParserPlugin, give it some name, eg. 'nw' * make a copy of the pipeline - your component should be at the same place, or just above, the wildcard processor also make sure you are setting your qparser for FQ queries, ie. fq={!nw}foo On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Thanks Roman. Based on some of your suggestions, will the steps below do the work? * Create (and register) a new SearchComponent * In its prepare method: Do for Q and all of the FQs (so this SearchComponent should run AFTER QueryComponent, in order to see all of the FQs) * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser, with a special implementation of QueryNodeProcessorPipeline, which contains my NodeProcessor in the top of its list. * Set my analyzer into that StandardQueryParser * My NodeProcessor will be called for each term in the query, so it can throw an exception if a (basic) querynode contains wildcard in both start and end of the term. Do I have a way to avoid from reimplementing the whole StandardQueryParser class? you can try subclassing it, if it allows it Will this work for both LuceneQParser and EdismaxQParser queries? this will not work for edismax, nothing but changing the edismax qparser will do the trick Any other solution/work-around? How do other production environments of Solr overcome this issue? you can also try modifying the standard solr parser, or even the JavaCC generated classes I believe many people do just that (or some sort of preprocessing) roman On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com wrote: You are right that starting to parse the query before the query component can get soon very ugly and complicated. You should take advantage of the flex parser, it is already in lucene contrib - but if you are interested in the better version, look at https://issues.apache.org/jira/browse/LUCENE-5014 The way you can solve this is: 1. use the standard syntax grammar (which allows *foo*) 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or raise error etc this way, you are changing semantics - but don't need to touch the syntax definition; of course, you may also change the grammar and allow only one instance of wildcard (or some combination) but for that you should probably use LUCENE-5014 roman On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
Prevention of heavy wildcard queries
Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
Re: Prevention of heavy wildcard queries
Thanks Roman. Based on some of your suggestions, will the steps below do the work? * Create (and register) a new SearchComponent * In its prepare method: Do for Q and all of the FQs (so this SearchComponent should run AFTER QueryComponent, in order to see all of the FQs) * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser, with a special implementation of QueryNodeProcessorPipeline, which contains my NodeProcessor in the top of its list. * Set my analyzer into that StandardQueryParser * My NodeProcessor will be called for each term in the query, so it can throw an exception if a (basic) querynode contains wildcard in both start and end of the term. Do I have a way to avoid from reimplementing the whole StandardQueryParser class? Will this work for both LuceneQParser and EdismaxQParser queries? Any other solution/work-around? How do other production environments of Solr overcome this issue? On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com wrote: You are right that starting to parse the query before the query component can get soon very ugly and complicated. You should take advantage of the flex parser, it is already in lucene contrib - but if you are interested in the better version, look at https://issues.apache.org/jira/browse/LUCENE-5014 The way you can solve this is: 1. use the standard syntax grammar (which allows *foo*) 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or raise error etc this way, you are changing semantics - but don't need to touch the syntax definition; of course, you may also change the grammar and allow only one instance of wildcard (or some combination) but for that you should probably use LUCENE-5014 roman On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
Re: Prevention of heavy wildcard queries
I don't want to affect on the (correctness of the) real query parsing, so creating a QParserPlugin is risky. Instead, If I'll parse the query in my search component, it will be detached from the real query parsing, (obviously this causes double parsing, but assume it's OK)... On Tue, May 28, 2013 at 3:52 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi Issac, it is as you say, with the exception that you create a QParserPlugin, not a search component * create QParserPlugin, give it some name, eg. 'nw' * make a copy of the pipeline - your component should be at the same place, or just above, the wildcard processor also make sure you are setting your qparser for FQ queries, ie. fq={!nw}foo On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Thanks Roman. Based on some of your suggestions, will the steps below do the work? * Create (and register) a new SearchComponent * In its prepare method: Do for Q and all of the FQs (so this SearchComponent should run AFTER QueryComponent, in order to see all of the FQs) * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser, with a special implementation of QueryNodeProcessorPipeline, which contains my NodeProcessor in the top of its list. * Set my analyzer into that StandardQueryParser * My NodeProcessor will be called for each term in the query, so it can throw an exception if a (basic) querynode contains wildcard in both start and end of the term. Do I have a way to avoid from reimplementing the whole StandardQueryParser class? you can try subclassing it, if it allows it Will this work for both LuceneQParser and EdismaxQParser queries? this will not work for edismax, nothing but changing the edismax qparser will do the trick Any other solution/work-around? How do other production environments of Solr overcome this issue? you can also try modifying the standard solr parser, or even the JavaCC generated classes I believe many people do just that (or some sort of preprocessing) roman On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com wrote: You are right that starting to parse the query before the query component can get soon very ugly and complicated. You should take advantage of the flex parser, it is already in lucene contrib - but if you are interested in the better version, look at https://issues.apache.org/jira/browse/LUCENE-5014 The way you can solve this is: 1. use the standard syntax grammar (which allows *foo*) 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or raise error etc this way, you are changing semantics - but don't need to touch the syntax definition; of course, you may also change the grammar and allow only one instance of wildcard (or some combination) but for that you should probably use LUCENE-5014 roman On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. Searching terms with wildcard in their start, is solved with ReversedWildcardFilterFactory. But, what about terms with wildcard in both start AND end? This query is heavy, and I want to disallow such queries from my users. I'm looking for a way to cause these queries to fail. I guess there is no built-in support for my need, so it is OK to write a new solution. My current plan is to create a search component (which will run before QueryComponent). It should analyze the query string, and to drop the query if too heavy wildcard are found. Another option is to create a query parser, which wraps the current (specified or default) qparser, and does the same work as above. These two options require an analysis of the query text, which might be an ugly work (just think about nested queries [using _query_], OR even a lot of more basic scenarios like quoted terms, etc.) Am I missing a simple and clean way to do this? What would you do? P.S. if no simple solution exists, timeAllowed limit is the best work-around I could think about. Any other suggestions?
Re: SurroundQParser does not analyze the query text
Thank you Erik and Jack. I opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-4834 I wish a will have time to sumbit a patch file soon. On Fri, May 17, 2013 at 7:38 AM, Jack Krupansky j...@basetechnology.comwrote: (Erik: Or he can get the LucidWorks Search product and then use near and before operators so that he doesn't need the surround query parser!) -- Jack Krupansky -Original Message- From: Erik Hatcher Sent: Thursday, May 16, 2013 6:11 PM To: solr-user@lucene.apache.org Subject: Re: SurroundQParser does not analyze the query text The issue can certainly be solved. But to me, it's actually a bit of a feature by design for the Lucene-level surround query parser to not do analysis, as it seems to have been meant for advanced query writers to piece together sophisticated SpanQuery-based pattern matching kinds of things utilizing their knowledge of how text was analyzed and indexed. But for sure it could be modified to do analysis, probably using the multiterm analyzer feature in there now elsewhere now. I looked into this when I did the basic work of integrating the surround query parser, and determined it was a lot of work because it'd need changes in the Lucene level code to leverage analysis, and then glue at the Solr level to be field type aware and savvy. By all means open and JIRA and contribute! Workaround? Client-side calls can be made to analyze text, and the client-side could build up a query expression based on term-by-term (or phrase) analysis results. Maybe that means a prohibitive number of requests to Solr to build up a query in a way that leverages Solr's field type analysis settings, but it is a technologically possible technique maybe worth considering. Erik On May 16, 2013, at 16:38 , Isaac Hebsh wrote: Hi, I'm trying to use Surround Query Parser for two reasons, which are not covered by proximity slops: 1. find documents with two words within a given distance, *unordered* 2. given two lists of words, find documents with (at least) one word from list A and (at least) one word from list B, within a given distance. The surround query parser looks great, but it have one big drawback - It does not analyze the query text. It is documented in the [weak :(] wiki page. Can this issue be solved somehow, or it is a bigger constraint? Should I open a JIRA issue for this? Any work-around?
Bloom Filters
Hi everyone.. I'm indexing docs into Solr using the update request handler, by POSTing data to the REST endpoint (not SolrJ, not DIH). My indexer should return an indication, whether the document existed in the collection before or not, based in its ID. The obvious solution is the perform a query, before trying to index the document. Do I have any better choice? If the query approach is chosen, I thought that BloomFilters might make this request very efficient. After searching in wiki and JIRA, I found this: http://wiki.apache.org/solr/BloomIndexComponent This JIRA issue is very old, and didn't managed to be resolved. What effort should be done, in order to make this issue resolved?
SurroundQParser does not analyze the query text
Hi, I'm trying to use Surround Query Parser for two reasons, which are not covered by proximity slops: 1. find documents with two words within a given distance, *unordered* 2. given two lists of words, find documents with (at least) one word from list A and (at least) one word from list B, within a given distance. The surround query parser looks great, but it have one big drawback - It does not analyze the query text. It is documented in the [weak :(] wiki page. Can this issue be solved somehow, or it is a bigger constraint? Should I open a JIRA issue for this? Any work-around?
Re: Basic auth on SolrCloud /admin/* calls
Hi Tim, Are you running Solr 4.2? (In 4.0 and 4.1, the Collections API didn't return any failure message. see SOLR-4043 issue). As far as I know, you can't tell Solr to use authentication credentials when communicating other nodes. It's a bigger issue.. for example, if you want to protect the /update requestHandler, so unauthorized users won't delete your whole collection, it can interfere the replication process. I think it's a necessary mechanism in production environment... I'm curious how do people use SolrCloud in production w/o it. On Fri, Mar 29, 2013 at 3:42 AM, Vaillancourt, Tim tvaillanco...@ea.comwrote: Hey guys, I've recently setup basic auth under Jetty 8 for all my Solr 4.x '/admin/*' calls, in order to protect my Collections and Cores API. Although the security constraint is working as expected ('/admin/*' calls require Basic Auth or return 401), when I use the Collections API to create a collection, I receive a 200 OK to the Collections API CREATE call, but the background Cores API calls that are ran on the Collection API's behalf fail on the Basic Auth on other nodes with a 401 code, as I should have foreseen, but didn't. Is there a way to tell SolrCloud to use authentication on internal Cores API calls that are spawned on Collections API's behalf, or is this a new feature request? To reproduce: 1. Implement basic auth on '/admin/*' URIs. 2. Perform a CREATE Collections API call to a node (which will return 200 OK). 3. Notice all Cores API calls fail (Collection isn't created). See stack trace below from the node that was issued the CREATE call. The stack trace I get is: org.apache.solr.common.SolrException: Server at http://HOST HERE:8983/solrhttp://%3cHOST%20HERE%3e:8983/solr returned non ok status:401, message:Unauthorized at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:169) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:135) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Cheers! Tim
Re: Combining Solr Indexes at SolrCloud
Let's say you have machine A and machine B. you want to shutdown B. If all the shards on B have replicas (on A), you can shutdown B instantly. If there is a shard on B that has no replica, you should create one on machine A (using Core API), let it replicate the whole shard contents, and then you are safe to shutdown B. [Changing the shard count of an existing collection is not possible for now, so MERGing cores is not relevant.] On Fri, Mar 29, 2013 at 11:23 AM, Furkan KAMACI furkankam...@gmail.comwrote: Let's assume that I have two machine in a SolrCloud that works as a part of cloud. If I want to shutdown one of them an combine its indexes into other how can I do that?
Solr 4.2 - DocValues on id field
Hi, The example schema.xml in Solr 4.2 does not define id field as docValues=true. Any good reason? (other than backward compat for index for previous version...) If my common case is fl=id (and no other field), DocValues is classic for me. Am I right?
Any documentation on Solr MBeans?
Hi, I'm trying to monitor some Solr behaviour, using JMX. It looks like a great job was done there, but I can't find any documentation on the MBeans themselves. For example, DirectUpdateHandler2 attributes. What is the difference between adds and cumulative_adds? Is adds count the last X seconds only? or maybe cumulative_adds survives a core reload?
Re: Timestamp field is changed on update
Hoss Man suggested a wonderful solution for this need: Always set update=add to the field you want to keep (is exists), and use FirstFieldValueUpdateProcessorFactory in the update chain, after DistributedUpdateProcessorFactory (so the AtomicUpdate will add the existing field before, if exists). This solution exactly covers my case. Thank you! On Wed, Feb 20, 2013 at 11:33 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Nobody responded my JIRA issue :( Should I commit this patch into SVN's trunk, and set the issue as Resolved? On Sun, Feb 17, 2013 at 9:26 PM, Isaac Hebsh isaac.he...@gmail.comwrote: Thank you Alex. Atomic Update allows you to add new values into multivalued field, for example... It means that the original document is being read (using RealTimeGet, which depends on updateLog). There is no reason that the list of operations (add/set/inc) will not include a create-only operation... I think that throwing it to the client is not a good idea, and even only because the required atomicity (which is handled in the DistributedUpdateProcessor using internal locks). There is no problem when using Atomic Update semantics on non-existent document. Indeed, it will work on stored fields only. On Sun, Feb 17, 2013 at 8:47 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Unless it is an Atomic Update, right. In which case Solr/Lucene will actually look at the existing document and - I assume - will preserve whatever field got already populated as long as it is stored. Should work for default values as well, right? They get populated on first creation, then that document gets partially updated. But I can't tell from the problem description whether it can be reformulated as something that fits Atomic Update. I think if the client does not know whether this is a new record or an update one, Solr will complain if Atomic Update semantics is used against non-existent document. Regards, Alex. P.s. Lots of conjecture here; I haven't tested exactly this use-case. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sun, Feb 17, 2013 at 12:40 AM, Walter Underwood wun...@wunderwood.org wrote: It is natural part of the update model for Solr (and for many other search engines). Solr does not do updates. It does add, replace, and delete. Every document is processed as if it was new. If there is already a document with that id, then the new document replaces it. The existing documents are not read during indexing. This allows indexing to be much faster than in a relational database. wunder
update fails if one doc is wrong
Hi. I add documents to Solr by POSTing them to UpdateHandler, as bulks of add commands (DIH is not used). If one document contains any invalid data (e.g. string data into numeric field), Solr returns HTTP 400 Bad Request, and the whole bulk is failed. I'm searching for a way to tell Solr to accept the rest of the documents... (I'll use RealTimeGet to determine which documents were added). If there is no standard way for doing it, maybe it can be implemented by spiltting the add commands into seperate HTTP POSTs. Because of using auto-soft-commit, can I say that it is almost equivalent? What is the performance penalty of 100 POST requests (of 1 document each) againt 1 request of 100 docs, if a soft commit is eventually done. Thanks in advance...
Re: Timestamp field is changed on update
Nobody responded my JIRA issue :( Should I commit this patch into SVN's trunk, and set the issue as Resolved? On Sun, Feb 17, 2013 at 9:26 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Thank you Alex. Atomic Update allows you to add new values into multivalued field, for example... It means that the original document is being read (using RealTimeGet, which depends on updateLog). There is no reason that the list of operations (add/set/inc) will not include a create-only operation... I think that throwing it to the client is not a good idea, and even only because the required atomicity (which is handled in the DistributedUpdateProcessor using internal locks). There is no problem when using Atomic Update semantics on non-existent document. Indeed, it will work on stored fields only. On Sun, Feb 17, 2013 at 8:47 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Unless it is an Atomic Update, right. In which case Solr/Lucene will actually look at the existing document and - I assume - will preserve whatever field got already populated as long as it is stored. Should work for default values as well, right? They get populated on first creation, then that document gets partially updated. But I can't tell from the problem description whether it can be reformulated as something that fits Atomic Update. I think if the client does not know whether this is a new record or an update one, Solr will complain if Atomic Update semantics is used against non-existent document. Regards, Alex. P.s. Lots of conjecture here; I haven't tested exactly this use-case. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sun, Feb 17, 2013 at 12:40 AM, Walter Underwood wun...@wunderwood.org wrote: It is natural part of the update model for Solr (and for many other search engines). Solr does not do updates. It does add, replace, and delete. Every document is processed as if it was new. If there is already a document with that id, then the new document replaces it. The existing documents are not read during indexing. This allows indexing to be much faster than in a relational database. wunder
Re: Timestamp field is changed on update
Thank you Alex. Atomic Update allows you to add new values into multivalued field, for example... It means that the original document is being read (using RealTimeGet, which depends on updateLog). There is no reason that the list of operations (add/set/inc) will not include a create-only operation... I think that throwing it to the client is not a good idea, and even only because the required atomicity (which is handled in the DistributedUpdateProcessor using internal locks). There is no problem when using Atomic Update semantics on non-existent document. Indeed, it will work on stored fields only. On Sun, Feb 17, 2013 at 8:47 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Unless it is an Atomic Update, right. In which case Solr/Lucene will actually look at the existing document and - I assume - will preserve whatever field got already populated as long as it is stored. Should work for default values as well, right? They get populated on first creation, then that document gets partially updated. But I can't tell from the problem description whether it can be reformulated as something that fits Atomic Update. I think if the client does not know whether this is a new record or an update one, Solr will complain if Atomic Update semantics is used against non-existent document. Regards, Alex. P.s. Lots of conjecture here; I haven't tested exactly this use-case. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sun, Feb 17, 2013 at 12:40 AM, Walter Underwood wun...@wunderwood.org wrote: It is natural part of the update model for Solr (and for many other search engines). Solr does not do updates. It does add, replace, and delete. Every document is processed as if it was new. If there is already a document with that id, then the new document replaces it. The existing documents are not read during indexing. This allows indexing to be much faster than in a relational database. wunder
Re: Timestamp field is changed on update
I opened a JIRA for this improvement request (attached a patch to DistributedUpdateProcessor). It's my first JIRA. please review it... (Or, if someone has an easier solution, tell us...) https://issues.apache.org/jira/browse/SOLR-4468 On Fri, Feb 15, 2013 at 8:13 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. I have a 'timestamp' field, which is a date, with a default value of 'NOW'. I want it to represent the datetime when the item was inserted (at the first time). Unfortunately, when the item is updated, the timestamp is changed... How can I implement INSERT TIME automatically?
Re: Timestamp field is changed on update
Hi, I do have an externally-created timestamp, but some minutes may pass before it will be sent to Solr. On Sat, Feb 16, 2013 at 10:39 PM, Walter Underwood wun...@wunderwood.orgwrote: Do you really want the time that Solr first saw it or do you want the time that the document was really created in the system? I think an external create timestamp would be a lot more useful. wunder On Feb 16, 2013, at 12:37 PM, Isaac Hebsh wrote: I opened a JIRA for this improvement request (attached a patch to DistributedUpdateProcessor). It's my first JIRA. please review it... (Or, if someone has an easier solution, tell us...) https://issues.apache.org/jira/browse/SOLR-4468 On Fri, Feb 15, 2013 at 8:13 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. I have a 'timestamp' field, which is a date, with a default value of 'NOW'. I want it to represent the datetime when the item was inserted (at the first time). Unfortunately, when the item is updated, the timestamp is changed... How can I implement INSERT TIME automatically?
Re: Timestamp field is changed on update
The component who sends the document does not know whether it is a new document or an update. These are my internal constraints.. But, guys, I think that it's a basic feature, and it will be better if Solr will support it without external help... On Sun, Feb 17, 2013 at 12:37 AM, Upayavira u...@odoko.co.uk wrote: I think what Walter means is make the thing that sends it to Solr set the timestamp when it does so. Upayavira On Sat, Feb 16, 2013, at 08:56 PM, Isaac Hebsh wrote: Hi, I do have an externally-created timestamp, but some minutes may pass before it will be sent to Solr. On Sat, Feb 16, 2013 at 10:39 PM, Walter Underwood wun...@wunderwood.orgwrote: Do you really want the time that Solr first saw it or do you want the time that the document was really created in the system? I think an external create timestamp would be a lot more useful. wunder On Feb 16, 2013, at 12:37 PM, Isaac Hebsh wrote: I opened a JIRA for this improvement request (attached a patch to DistributedUpdateProcessor). It's my first JIRA. please review it... (Or, if someone has an easier solution, tell us...) https://issues.apache.org/jira/browse/SOLR-4468 On Fri, Feb 15, 2013 at 8:13 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. I have a 'timestamp' field, which is a date, with a default value of 'NOW'. I want it to represent the datetime when the item was inserted (at the first time). Unfortunately, when the item is updated, the timestamp is changed... How can I implement INSERT TIME automatically?
Re: How to limit queries to specific IDs
Thank you, Erick! Three great answers! On Wed, Feb 13, 2013 at 4:20 AM, Erick Erickson erickerick...@gmail.comwrote: First, it may not be a problem assuming your other filter queries are more frequent. Second, the easiest way to keep these out of the filter cache would be just to include them as a MUST clause, like +(original query) +id:(1 2 3 4). Third possibility, see https://issues.apache.org/jira/browse/SOLR-2429, but the short form is: fq={!cache=false}restoffq On Mon, Feb 11, 2013 at 2:41 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi everyone. I have queries that should be bounded to a set of IDs (the uniqueKey field of my schema). My client front-end sends two Solr request: In the first one, it wants to get the top X IDs. This result should return very fast. No time to waste on highlighting. this is a very standard query. In the aecond one, it wants to get the highlighting info (corresponding to the queried fields and terms, of course), on those documents (may be some sequential requests, on small bulks of the full list). These two requests are implemented as almost identical calls, to different requestHandlers. I thought to append a filter query to the second request, id:(1 2 3 4 5). Is this idea good for Solr? If does, my problem is that I don't want these filters to flood my filterCache... Is there any way (even if it involves some coding...) to add a filter query which won't be added to filterCache (at least, not instead of standard filters)? Notes: 1. It can't be assured that the the first query will remain in queryResultsCache... 2. consider index size of 50M documents...
How to limit queries to specific IDs
Hi everyone. I have queries that should be bounded to a set of IDs (the uniqueKey field of my schema). My client front-end sends two Solr request: In the first one, it wants to get the top X IDs. This result should return very fast. No time to waste on highlighting. this is a very standard query. In the aecond one, it wants to get the highlighting info (corresponding to the queried fields and terms, of course), on those documents (may be some sequential requests, on small bulks of the full list). These two requests are implemented as almost identical calls, to different requestHandlers. I thought to append a filter query to the second request, id:(1 2 3 4 5). Is this idea good for Solr? If does, my problem is that I don't want these filters to flood my filterCache... Is there any way (even if it involves some coding...) to add a filter query which won't be added to filterCache (at least, not instead of standard filters)? Notes: 1. It can't be assured that the the first query will remain in queryResultsCache... 2. consider index size of 50M documents...
Re: Trying to understand soft vs hard commit vs transaction log
Shawn, what about 'flush to disk' behaviour on MMapDirectoryFactory? On Fri, Feb 8, 2013 at 11:12 AM, Prakhar Birla prakharbi...@gmail.comwrote: Great explanation Shawn! BTW soft commited documents will be not be recovered on JVM crash. On 8 February 2013 13:27, Shawn Heisey s...@elyograg.org wrote: On 2/7/2013 9:29 PM, Alexandre Rafalovitch wrote: Hello, What actually happens when using soft (as opposed to hard) commit? I understand somewhat very high-level picture (documents become available faster, but you may loose them on power loss). I don't care about low-level implementation details. But I am trying to understand what is happening on the medium level of details. For example what are stages of a document if we are using all available transaction log, soft commit, hard commit options? It feels like there is three stages: *) Uncommitted (soft or hard): accessible only via direct real-time get? *) Soft-committed: accessible through all search operatons? (but not on disk? but where is it? in memory?) *) Hard-committed: all the same as soft-committed but it is now on disk Similarly, in performance section of Wiki, it says: A commit (including a soft commit) will free up almost all heap memory - why would soft commit free up heap memory? I thought it was not flushed to disk. Also, with soft-commits and transaction log enabled, doesn't transaction log allows to replay/recover the latest state after crash? I believe that's what transaction log does for the database. If not, how does one recover, if at all? And where does openSearcher=false fits into that? Does it cause inconsistent results somehow? I am missing something, but I am not sure what or where. Any points in the right direction would be appreciated. Let's see if I can answer your questions without giving you incorrect information. New indexed content is not searchable until you open a new searcher, regardless of the type of commit that you do. A hard commit will close the current transaction log and start a new one. It will also instruct the Directory implementation to flush to disk. If you specify openSearcher=false, then the content that has just been committed will NOT be searchable, as discussed in the previous paragraph. The existing searcher will remain open and continue to serve queries against the same index data. A soft commit does not flush the new content to disk, but it does open a new searcher. I'm sure that the amount of memory available for caching this content is not large, so it's possible that if you do a lot of indexing with soft commits and your hard commits are too infrequent, you'll end up flushing part of the cached data to disk anyway. I'd love to hear from a committer about this, because I could be wrong. There's a caveat with that 'flush to disk' operation -- the default Directory implementation in the Solr example config, which is NRTCachingDirectoryFactory, will cache the last few megabytes of indexed data and not flush it to disk even with a hard commit. If your commits are small, then the net result is similar to a soft commit. If the server or Solr were to crash, the transaction logs would be replayed on Solr startup, recovering that last few megabytes. The transaction log may also recover documents that were soft committed, but I'm not 100% sure about that. To take full advantage of NRT functionality, you can commit as often as you like with soft commits. On some reasonable interval, say every one to fifteen minutes, you can issue a hard commit with openSearcher set to false, to flush things to disk and cycle through transaction logs before they get huge. Solr will keep a few of the transaction logs around, and if they are huge, it can take a long time to replay them. You'll want to choose a hard commit interval that doesn't create giant transaction logs. If any of the info I've given here is wrong, someone should correct me! Thanks, Shawn -- Regards, Prakhar Birla +91 9739868086
Re: IP Address as number
Small addition: To support query, I probably have to implement an analyzer (query time)... An analyzer can be configured on numeric (i.e non TEXT) field? On Thu, Feb 7, 2013 at 6:48 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi. I have to index field which contains an IP address. Users want to query this field using RANGE queries. to support this, the IP is stored as its DWORD value (assume it is IPv4...). On the other side, users supply the IP addresses textually (xxx.xxx.xxx.xxx). I can write a new field type, extends TrieLongField, which will change the textual representation to numeric one. But what about the stored field retrieval? I want to return the textual form.. may be a search component, which changes the stored fields? Has anyone encountered this need before?
Re: Servlet Filter for randomizing core names
LBHttpSolrServer is only solrj feature.. doesn't it? I think that Solr does not balance queries among cores in the same server. You can claim that it's a non-issue, if a single core can completely serve multiple queries on the same time, and passing requests through different cores does nothing. I feel that we can achieve some improvement in this case... On Mon, Feb 4, 2013 at 12:45 AM, Shawn Heisey s...@elyograg.org wrote: On 2/3/2013 3:24 PM, Isaac Hebsh wrote: Thanks Shawn for your quick answer. When using collection name, Solr will choose the leader, when available in the current server (see getCoreByCollection in SolrDispatchFilter). It is clear that it's useful when indexing. But queries should run on replicas too, don't they? Moreover, the core selection seems to be consistent (that is, it will never get the non-first core in a specific arrangement)... Under the assumption that a core makes extra work for serving queries (e.g, combining results, processing every non distributed search component (?)), and the assumption that multithreading works well here, Is utilizing all the cores would not be useful? Here's an excerpt from the SolrCloud wiki page that suggests it handles load balancing across the cluster automatically: Now send a query to any of the servers to query the cluster: http://localhost:7500/solr/**collection1/select?q=*:*http://localhost:7500/solr/collection1/select?q=*:* Send this query multiple times and observe the logs from the solr servers. You should be able to observe Solr load balancing the requests (done via LBHttpSolrServer ?) across replicas, using different servers to satisfy each request. This is near the end of example B. http://wiki.apache.org/solr/**SolrCloud#Example_B:_Simple_** two_shard_cluster_with_shard_**replicashttp://wiki.apache.org/solr/SolrCloud#Example_B:_Simple_two_shard_cluster_with_shard_replicas Thanks, Shawn
Re: Servlet Filter for randomizing core names
Of course I did not mean to multiple cores of the same shard... A normal SolrCloud configuration, let's say 4 shards, on 4 servers, using replicationFactor=3. Of course, no matter what core was requested, the request will be forwarded to one core of each shard. My question is - whether this *first* request should be distributed over all of the cores in a specific server or not. The statement Cores are completely thread safe and can do queries/updates concurrently answers me that there is no reason for my idea. On Mon, Feb 4, 2013 at 9:28 PM, Shawn Heisey s...@elyograg.org wrote: On 2/4/2013 12:06 PM, Isaac Hebsh wrote: LBHttpSolrServer is only solrj feature.. doesn't it? I think that Solr does not balance queries among cores in the same server. You can claim that it's a non-issue, if a single core can completely serve multiple queries on the same time, and passing requests through different cores does nothing. I feel that we can achieve some improvement in this case... If LBHttpSolrServer is used as described in the Wiki (whoever wrote that wasn't sure, they were asking), then it is being used on the server side, not the client. Multiple copies of a shard on the same server is probably not a generally supported config with SolrCloud. It would use more memory and disk space, and I'm not sure that there would be any actual benefit to query speed. Cores are completely thread safe and can do queries/updates concurrently. Whatever concurrency problems exist are likely due to resource (CPU, RAM, I/O) utilization rather than code limitations. If I'm right about that, multiple copies would not solve the problem. Buying a bigger/faster server would be the solution to that problem. Thanks, Shawn
Re: Servlet Filter for randomizing core names
Thanks Shawn for your quick answer. When using collection name, Solr will choose the leader, when available in the current server (see getCoreByCollection in SolrDispatchFilter). It is clear that it's useful when indexing. But queries should run on replicas too, don't they? Moreover, the core selection seems to be consistent (that is, it will never get the non-first core in a specific arrangement)... Under the assumption that a core makes extra work for serving queries (e.g, combining results, processing every non distributed search component (?)), and the assumption that multithreading works well here, Is utilizing all the cores would not be useful? On Sun, Feb 3, 2013 at 11:49 PM, Shawn Heisey s...@elyograg.org wrote: On 2/3/2013 1:18 PM, Isaac Hebsh wrote: Hi. I have a SolrCloud cluster, which contains some servers. each server runs multiple cores. I want to distribute the requests over the running cores on each server, without knowing the cores names in the client. Question 1: Do I have any reason to do this (when indexing? when querying?). All of these cores are sharing the same system resources, but I guess that I still get a better performance if same amount of requests are going to each core. Am I right? If you are using a cloud-aware API (such as CloudSolrServer from SolrJ), your client knows about your zookeeper setup. Behind the scenes, it consults zookeeper about how to find the various servers and cores. You never have to configure any core names on the client. If you are not using a cloud-aware API, shouldn't you be talking to the collection, not the cores? That is, talk to /solr/test, not /solr/test_shard1_replica1 in your program. That should cause Solr itself to figure out where the cores are and forward requests as necessary. Couple that with a load balancer and it approaches what a cloud-aware API gives you in terms of reliability. From my attempts to help people in the IRC channel, I have concluded that Solr 4.0 may use the name of the collection as the name of the core on each server. I have not actually used SolrCloud in 4.0, so I cannot say. Solr 4.1 does not do this. If you create a collection named test with 2 shards and 2 replicas with the collections API, you get the following cores distributed among your servers: test_shard1_replica1 test_shard1_replica2 test_shard2_replica1 test_shard2_replica2 Question 2: I've implemented a nice ServletFilter, which replaces the magic name /randomcore/ with a random core name (retrieved from CoreContainer). I'm using RequestDispatcher.forward, on the new URI. It works, very cool :) But, for making it work, I had to set dispatcherFORWARD/**dispatcher on SolrRequestFilter. this setting is explicitly inadvisable in web.xml. Can anyone explain why? No idea here. Thanks, Shawn
Re: Distibuted search
Well, My index is already broken to 16 shards... The behaviour I supposed - It absolutely doesn't happen... Right? Does it make sense somehow as an improvement request? Technically, Can multiple Lucene responses be intersected this way? On Mon, Jan 28, 2013 at 9:27 PM, Mingfeng Yang mfy...@wisewindow.comwrote: In your case, since there is no co-current queries, adding replicas won't help much on improving the response speed. However, break your index into a few shards do help increase query performance. I recently break an index with 30 million documents (30G) into 4 shards, and the boost is pretty impressive (roughly 2-5x faster for a complicated query) Ming On Mon, Jan 28, 2013 at 10:54 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Does adding replicas (on additional servers) help to improve search performance? It is known that each query goes to all the shards. It's clear that if we have massive load, then multiple cores serving the same shard are very useful. But what happens if I'll never have concurrent queries (one query is in the system at any time), but I want these single queries to return faster. Is a bigger replication factor will contribute? Especially, Will a complicated query (with a large amount of queried fields) go to multiple cores *of the same shard*? (E.g. core1 searching for term1 in field1, and core2 searching for term 2 in field2) And what about a query on a single field, which contains a lot of terms? Thanks in advance..
Re: secure Solr server
You can define a security filter in WEB-INF\web.xml, on specific url patterns. You might want to set the url pattern to /admin/*. [find examples here: http://stackoverflow.com/questions/7920092/how-can-i-bypass-security-filter-in-web-xml ] On Sun, Jan 27, 2013 at 8:07 PM, Mingfeng Yang mfy...@wisewindow.comwrote: Before Solr 4.0, I secure solr by enable password protection in Jetty. However, password protection will make solrcloud not work. We use EC2 now, and we need the www admin interface of solr to be accessible (with password) from anywhere. How do you protect your solr sever from unauthorized access? Thanks, Ming
Re: uniqueKey field type
id field is not serial, it generated randomly.. so range queries on this field are almost useless. I mentioned TrieField, because solr.LongField is internally implemented as a string, while solr.TrieLongField is a number. It might improve performace, even without setting a precisionStep... On Thu, Jan 24, 2013 at 3:31 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I think trie type fields add value only if you do range queries in them and it sounds like that is bit your use case. Otis Solr ElasticSearch Support http://sematext.com/ On Jan 23, 2013 2:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, In my use case, Solr have to to return only the id field, as a response for queries. However, it should return 1000 docs at once (rows=1000). My id field is defined as StrField, due to external systems constraints. I guess that TrieFields are more efficient than StrFields. *Theoretically*, the field content can be retrieved without loading the stored field. Should I strive that the id will be managed as a number, or it has no contribution to performance (search retrieve times)? (Yes, I know that lucene has an internal id mechanism. I think it is not relevant to my question...) - Isaac.
Re: Solr cache considerations
Wow Erick, The MMap acrtivle is a very fundamental one. Totaly changed my view. It must be mentioned in SolrPerformanceFactors in SolrWiki... I'm sorry I did not know it before. Thank you a lot. I promise to share my results then my cart will start to fly :) On Sun, Jan 20, 2013 at 6:08 PM, Erick Erickson erickerick...@gmail.comwrote: About your question about document cache: Typically the document cache has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very often. And remember that this cache is only hit when assembling the response for a few documents (your page size). Bottom line: I wouldn't worry about this cache much. It's quite useful for processing a particular query faster, but not really intended for cross-query use. Really, I think you're getting the cart before the horse here. Run it up the flagpole and try it. Rely on the OS to do its job (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html). Find a bottleneck _then_ tune. Premature optimization and all that Several tens of millions of docs isn't that large unless the text fields are enormous. Best Erick On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Ok. Thank you everyone for your helpful answers. I understand that fieldValueCache is not used for resolving queries. Is there any cache that can help this basic scenario (a lot of different queries, on a small set of fields)? Does Lucene's FieldCache help (implicitly)? How can I use RAM to reduce I/O in this type of queries? On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: No, the fieldValueCache is not used for resolving queries. Only for multi-token faceting and apparently for the stats component too. The document cache maintains in memory the stored content of the fields you are retrieving or highlighting on. It'll hit if the same document matches the query multiple times and the same fields are requested, but as Eirck said, it is important for cases when multiple components in the same request need to access the same data. I think soft committing every 10 minutes is totally fine, but you should hard commit more often if you are going to be using transaction log. openSearcher=false will essentially tell Solr not to open a new searcher after the (hard) commit, so you won't see the new indexed data and caches wont be flushed. openSearcher=false makes sense when you are using hard-commits together with soft-commits, as the soft-commit is dealing with opening/closing searchers, you don't need hard commits to do it. Tomás On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Unfortunately, it seems ( http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that these caches are not per-segment. In this case, I want to (soft) commit less frequently. Am I right? Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I guess it has a big contribution to standard (not only faceted) queries time. SolrWiki claims that it primarily used by faceting. What that says about complex textual queries? documentCache: Erick, After a query processing is finished, doesn't some documents stay in the documentCache? can't I use it to accelerate queries that should retrieve stored fields of documents? In this case, a big documentCache can hold more documents.. About commit frequency: HardCommit: openSearch=false seems as a nice solution. Where can I read about this? (found nothing but one unexplained sentence in SolrWiki). SoftCommit: In my case, the required index freshness is 10 minutes. The plan to soft commit every 10 minutes is similar to storing all of the documents in a queue (outside to Solr), an indexing a bulk every 10 minutes. Thanks. On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: I think fieldValueCache is not per segment, only fieldCache is. However, unless I'm missing something, this cache is only used for faceting on multivalued fields On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.com wrote: filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in cache). Notice the /8. This reflects the fact that the filters are represented by a bitset on the _internal_ Lucene ID. UniqueId has no bearing here whatsoever. This is, in a nutshell, why warming is required, the internal Lucene IDs may change. Note also that it's maxDoc, the internal arrays have holes for deleted documents. Note this is an _upper_ bound, if there are only a few docs that match, the size will be (num of matching docs) * sizeof(int)). fieldValueCache. I don't think so, although I'm a bit fuzzy on this. It depends on whether
Re: Solr cache considerations
Ok. Thank you everyone for your helpful answers. I understand that fieldValueCache is not used for resolving queries. Is there any cache that can help this basic scenario (a lot of different queries, on a small set of fields)? Does Lucene's FieldCache help (implicitly)? How can I use RAM to reduce I/O in this type of queries? On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: No, the fieldValueCache is not used for resolving queries. Only for multi-token faceting and apparently for the stats component too. The document cache maintains in memory the stored content of the fields you are retrieving or highlighting on. It'll hit if the same document matches the query multiple times and the same fields are requested, but as Eirck said, it is important for cases when multiple components in the same request need to access the same data. I think soft committing every 10 minutes is totally fine, but you should hard commit more often if you are going to be using transaction log. openSearcher=false will essentially tell Solr not to open a new searcher after the (hard) commit, so you won't see the new indexed data and caches wont be flushed. openSearcher=false makes sense when you are using hard-commits together with soft-commits, as the soft-commit is dealing with opening/closing searchers, you don't need hard commits to do it. Tomás On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Unfortunately, it seems ( http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that these caches are not per-segment. In this case, I want to (soft) commit less frequently. Am I right? Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I guess it has a big contribution to standard (not only faceted) queries time. SolrWiki claims that it primarily used by faceting. What that says about complex textual queries? documentCache: Erick, After a query processing is finished, doesn't some documents stay in the documentCache? can't I use it to accelerate queries that should retrieve stored fields of documents? In this case, a big documentCache can hold more documents.. About commit frequency: HardCommit: openSearch=false seems as a nice solution. Where can I read about this? (found nothing but one unexplained sentence in SolrWiki). SoftCommit: In my case, the required index freshness is 10 minutes. The plan to soft commit every 10 minutes is similar to storing all of the documents in a queue (outside to Solr), an indexing a bulk every 10 minutes. Thanks. On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: I think fieldValueCache is not per segment, only fieldCache is. However, unless I'm missing something, this cache is only used for faceting on multivalued fields On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.com wrote: filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in cache). Notice the /8. This reflects the fact that the filters are represented by a bitset on the _internal_ Lucene ID. UniqueId has no bearing here whatsoever. This is, in a nutshell, why warming is required, the internal Lucene IDs may change. Note also that it's maxDoc, the internal arrays have holes for deleted documents. Note this is an _upper_ bound, if there are only a few docs that match, the size will be (num of matching docs) * sizeof(int)). fieldValueCache. I don't think so, although I'm a bit fuzzy on this. It depends on whether these are per-segment caches or not. Any per segment cache is still valid. Think of documentCache as intended to hold the stored fields while various components operate on it, thus avoiding repeatedly fetching the data from disk. It's _usually_ not too big a worry. About hard-commits once a day. That's _extremely_ long. Think instead of committing more frequently with openSearcher=false. If nothing else, you transaction log will grow lots and lots and lots. I'm thinking on the order of 15 minutes, or possibly even much less. With softCommits happening more often, maybe every 15 seconds. In fact, I'd start out with soft commits every 15 seconds and hard commits (openSearcher=false) every 5 minutes. The problem with hard commits being once a day is that, if for any reason the server is interrupted, on startup Solr will try to replay the entire transaction log to assure index integrity. Not to mention that your tlog will be huge. Not to mention that there is some memory usage for each document in the tlog. Hard commits roll over the tlog, flush the in-memory tlog pointers, close index segments, etc. Best Erick On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, I am going to build a big
Re: Solr cache considerations
Unfortunately, it seems ( http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that these caches are not per-segment. In this case, I want to (soft) commit less frequently. Am I right? Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I guess it has a big contribution to standard (not only faceted) queries time. SolrWiki claims that it primarily used by faceting. What that says about complex textual queries? documentCache: Erick, After a query processing is finished, doesn't some documents stay in the documentCache? can't I use it to accelerate queries that should retrieve stored fields of documents? In this case, a big documentCache can hold more documents.. About commit frequency: HardCommit: openSearch=false seems as a nice solution. Where can I read about this? (found nothing but one unexplained sentence in SolrWiki). SoftCommit: In my case, the required index freshness is 10 minutes. The plan to soft commit every 10 minutes is similar to storing all of the documents in a queue (outside to Solr), an indexing a bulk every 10 minutes. Thanks. On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: I think fieldValueCache is not per segment, only fieldCache is. However, unless I'm missing something, this cache is only used for faceting on multivalued fields On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.com wrote: filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in cache). Notice the /8. This reflects the fact that the filters are represented by a bitset on the _internal_ Lucene ID. UniqueId has no bearing here whatsoever. This is, in a nutshell, why warming is required, the internal Lucene IDs may change. Note also that it's maxDoc, the internal arrays have holes for deleted documents. Note this is an _upper_ bound, if there are only a few docs that match, the size will be (num of matching docs) * sizeof(int)). fieldValueCache. I don't think so, although I'm a bit fuzzy on this. It depends on whether these are per-segment caches or not. Any per segment cache is still valid. Think of documentCache as intended to hold the stored fields while various components operate on it, thus avoiding repeatedly fetching the data from disk. It's _usually_ not too big a worry. About hard-commits once a day. That's _extremely_ long. Think instead of committing more frequently with openSearcher=false. If nothing else, you transaction log will grow lots and lots and lots. I'm thinking on the order of 15 minutes, or possibly even much less. With softCommits happening more often, maybe every 15 seconds. In fact, I'd start out with soft commits every 15 seconds and hard commits (openSearcher=false) every 5 minutes. The problem with hard commits being once a day is that, if for any reason the server is interrupted, on startup Solr will try to replay the entire transaction log to assure index integrity. Not to mention that your tlog will be huge. Not to mention that there is some memory usage for each document in the tlog. Hard commits roll over the tlog, flush the in-memory tlog pointers, close index segments, etc. Best Erick On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, I am going to build a big Solr (4.0?) index, which holds some dozens of millions of documents. Each document has some dozens of fields, and one big textual field. The queries on the index are non-trivial, and a little-bit long (might be hundreds of terms). No query is identical to another. Now, I want to analyze the cache performance (before setting up the whole environment), in order to estimate how much RAM will I need. filterCache: In my scenariom, every query has some filters. let's say that each filter matches 1M documents, out of 10M. Does the estimated memory usage should be 1M * sizeof(uniqueId) * num-of-filters-in-cache? fieldValueCache: Due to the difference between queries, I guess that fieldValueCache is the most important factor on query performance. Here comes a generic question: I'm indexing new documents to the index constantly. Soft commits will be performed every 10 mins. Does it say that the cache is meaningless, after every 10 minutes? documentCache: enableLazyFieldLoading will be enabled, and fl contains a very small set of fields. BUT, I need to return highlighting on about (possibly) 20 fields. Does the highlighting component use the documentCache? I guess that highlighting requires the whole field to be loaded into the documentCache. Will it happen only for fields that matched a term from the query? And one more question: I'm planning to hard-commit once a day. Should I prepare to a significant RAM usage growth between hard-commits? (consider a lot of new documents in this period...) Does this RAM