Re: gzip compression solr 8.4.1
Hi, We did further tests to see where the problem exactly is. These are our outcomes: The content-length is calculated correctly, a quick test with curl showed this. The problem is that the stream with the gzip data is not fully consumed and afterwards not closed. Using the debugger with a breakpoint at org/apache/solr/common/util/Utils.java:575 shows that it won't enter the function readFully((entity.getContent()) most likely due to how the gzip stream content is wrapped and extracted beforehand. On line org/apache/solr/common/util/Utils.java:582 the consumeQuietly(entity) should close the stream but does not because of a silent exception. This seems to be the same as it is described in https://issues.apache.org/jira/browse/SOLR-14457 We saw that the problem happened also with correct GZIP responses from jetty. Not only with non-GZIP as described within the jira issue. Best, Johannes Am Do., 23. Apr. 2020 um 09:55 Uhr schrieb Johannes Siegert < johannes.sieg...@offerista.com>: > Hi, > > we want to use gzip-compression between our application and the solr > server. > > We use a standalone solr server version 8.4.1 and the prepackaged jetty as > application server. > > We have enabled the jetty gzip module by adding these two files: > > {path_to_solr}/server/modules/gzip.mod (see below the question) > {path_to_solr}/server/etc/jetty-gzip.xml (see below the question) > > Within the application we use a HttpSolrServer that is configured with > allowCompression=true. > > After we had released our application we saw that the number of > connections within the TCP-state CLOSE_WAIT rising up until the application > was not able to open new connections. > > > After a long debugging session we think the problem is that the header > "Content-Length" that is returned by the jetty is sometimes wrong when > gzip-compression is enabled. > > The solrj client uses a ContentLengthInputStream, that uses the header > "Content-Lenght" to detect if all data was received. But the InputStream > can not be fully consumed because the value of the header "Content-Lenght" > is higher than the actual content-length. > > Usually the method PoolingHttpClientConnectionManager.releaseConnection is > called after the InputStream was fully consumed. This give the connection > free to be reused or to be closed by the application. > > Due to the incorrect header "Content-Length" the > PoolingHttpClientConnectionManager.releaseConnection method is never called > and the connection stays active. After the connection-timeout of the jetty > is reached, it closes the connection from the server-side and the TCP-state > switches into CLOSE_WAIT. The client never closes the connection and so the > number of connections in use rises up. > > > Currently we try to configure the jetty gzip module to return no > "Content-Length" if gzip-compression was used. We hope that in this case > another InputStream implementation is used that uses the NULL-terminator to > see when the InputStream was fully consumed. > > Do you have any experiences with this problem or any suggestions for us? > > Thanks, > > Johannes > > > gzip.mod > > - > > DO NOT EDIT - See: > https://www.eclipse.org/jetty/documentation/current/startup-modules.html > > [description] > Enable GzipHandler for dynamic gzip compression > for the entire server. > > [tags] > handler > > [depend] > server > > [xml] > etc/jetty-gzip.xml > > [ini-template] > ## Minimum content length after which gzip is enabled > jetty.gzip.minGzipSize=2048 > > ## Check whether a file with *.gz extension exists > jetty.gzip.checkGzExists=false > > ## Gzip compression level (-1 for default) > jetty.gzip.compressionLevel=-1 > > ## User agents for which gzip is disabled > jetty.gzip.excludedUserAgent=.*MSIE.6\.0.* > > - > > jetty-gzip.xml > > - > > > http://www.eclipse.org/jetty/configure_9_3.dtd";> > > > > > > > > > > > > class="org.eclipse.jetty.server.handler.gzip.GzipHandler"> > > deprecated="gzip.minGzipSize" default="2048" /> > > > deprecated="gzip.checkGzExists" default="false" /> > > > deprecated="gzip.compressionLevel" default="-1" /> > >
gzip compression solr 8.4.1
Hi, we want to use gzip-compression between our application and the solr server. We use a standalone solr server version 8.4.1 and the prepackaged jetty as application server. We have enabled the jetty gzip module by adding these two files: {path_to_solr}/server/modules/gzip.mod (see below the question) {path_to_solr}/server/etc/jetty-gzip.xml (see below the question) Within the application we use a HttpSolrServer that is configured with allowCompression=true. After we had released our application we saw that the number of connections within the TCP-state CLOSE_WAIT rising up until the application was not able to open new connections. After a long debugging session we think the problem is that the header "Content-Length" that is returned by the jetty is sometimes wrong when gzip-compression is enabled. The solrj client uses a ContentLengthInputStream, that uses the header "Content-Lenght" to detect if all data was received. But the InputStream can not be fully consumed because the value of the header "Content-Lenght" is higher than the actual content-length. Usually the method PoolingHttpClientConnectionManager.releaseConnection is called after the InputStream was fully consumed. This give the connection free to be reused or to be closed by the application. Due to the incorrect header "Content-Length" the PoolingHttpClientConnectionManager.releaseConnection method is never called and the connection stays active. After the connection-timeout of the jetty is reached, it closes the connection from the server-side and the TCP-state switches into CLOSE_WAIT. The client never closes the connection and so the number of connections in use rises up. Currently we try to configure the jetty gzip module to return no "Content-Length" if gzip-compression was used. We hope that in this case another InputStream implementation is used that uses the NULL-terminator to see when the InputStream was fully consumed. Do you have any experiences with this problem or any suggestions for us? Thanks, Johannes gzip.mod - DO NOT EDIT - See: https://www.eclipse.org/jetty/documentation/current/startup-modules.html [description] Enable GzipHandler for dynamic gzip compression for the entire server. [tags] handler [depend] server [xml] etc/jetty-gzip.xml [ini-template] ## Minimum content length after which gzip is enabled jetty.gzip.minGzipSize=2048 ## Check whether a file with *.gz extension exists jetty.gzip.checkGzExists=false ## Gzip compression level (-1 for default) jetty.gzip.compressionLevel=-1 ## User agents for which gzip is disabled jetty.gzip.excludedUserAgent=.*MSIE.6\.0.* - jetty-gzip.xml - http://www.eclipse.org/jetty/configure_9_3.dtd";> -
ManagedFilter for stemming
Hi, we are using the SnowballPorterFilter to stem our tokens for serveral languages. Now we want to update the list of protected words over the Solr-API. As I can see, there are only solutions for SynonymFilter and the StopwordFilter with ManagedSynonymFilter and ManagedStopFilter. Do you know any solution for my problem? Thanks, Johannes
Re: optimize cache-hit-ratio of filter- and query-result-cache
Thanks. The statements on http://wiki.apache.org/solr/SolrCaching#showItems are not explicitly enough for my question.
optimize cache-hit-ratio of filter- and query-result-cache
Hi, some of my solr indices have a low cache-hit-ratio. 1 Does sorting the parts of a single filter-query have impact on filter-cache- and query-result-cache-hit-ratio? 1.1 Example: fq=field1:(2 or 3 or 1) to fq=field1:(1 or 2 or 3) -> if 1,2,3 are randomly sorted 2 Does sorting the parts of the query have impact on query-result-cache-hit-ratio? 2.1 Example: "q=abc&fq=field1:abc&sort=field1 desc&fq=field2:xyz&sort=field2 asc" to "q=abc&fq=field1:abc&fq=field2:xyz&sort=field1 desc&sort=field2 asc" -> if the query parts are randomly sorted Thanks! Johannes
sort by given order
Hi, i want to sort my documents by a given order. The order is defined by a list of ids. My current solution is: list of ids: 15, 5, 1, 10, 3 query: q=*:*&fq=(id:((15) OR (5) OR (1) OR (10) OR (3)))&sort=query($idqsort) desc,id asc&idqsort=id:((15^5) OR (5^4) OR (1^3) OR (10^2) OR (3^1))&start=0&rows=5 Do you know an other solution to sort by a list of ids? Thanks! Johannes
NGramTokenizer influence to length normalization?
Hi, does the NGramTokenizer have an influence to the length normalization? Thanks. Johannes
wrong docFreq while executing query based on uniqueKey-field
Hi. My solr-index (version=4.7.2.) has an id-field: ... id The index will be updated once per hour. I use the following query to retrieve some documents: "q=id:2^2 id:1^1" I would expect that the document(2) should be always before the document(1). But after many index updates document(1) is before document(2). With debug=true I could see the problem. The document(1) has a docFreq=2, while the document(2) has a docFreq=1. How could the docFreq of the uniqueKey-field be hight than 1? Could anyone explain this behavior to me? Thanks! Johannes
Re: default query operator ignored by edismax query parser
Thanks Shawn! In this case I will use operators everywhere. Johannes Am 25.06.2014 15:09, schrieb Shawn Heisey: On 6/25/2014 1:05 AM, Johannes Siegert wrote: I have defined the following edismax query parser: 100%edismax0.01100*:*ANDfield1^2.0 field210* My search query looks like: q=(word1 word2) OR (word3 word4) Since I specified AND as default query operator, the query should match documents by ((word1 AND word2) OR (word3 AND word4)) but the query matches documents by ((word1 OR word2) OR (word3 OR word4)). Could anyone explain the behaviour? I believe that you are running into this bug: https://issues.apache.org/jira/browse/SOLR-2649 It's a very old bug, coming up on three years. The workaround is to not use boolean operators at all, or to use operators EVERYWHERE so that your intent is explicitly described. It is not much of a workaround, but it does work. Thanks, Shawn
default query operator ignored by edismax query parser
Hi, I have defined the following edismax query parser: name="defaults">100%name="defType">edismax0.01name="ps">100*:*name="q.op">ANDfield1^2.0 field2name="rows">10* My search query looks like: q=(word1 word2) OR (word3 word4) Since I specified AND as default query operator, the query should match documents by ((word1 AND word2) OR (word3 AND word4)) but the query matches documents by ((word1 OR word2) OR (word3 OR word4)). Could anyone explain the behaviour? Thanks! Johannes P.S. The query q=(word1 word2) match all documents by (word1 AND word2)
Bug within the solr query parser (version 4.7.1)
Hi, I have updated my solr instance from 4.5.1 to 4.7.1. Now the parsed query seems to be not correct. Query: /*q=*:*&fq=title:T&E&debug=true */ Before the update the parsed filter query is "*/+title:t&e +title:t +title:e/*". After the update the parsed filter query is "*/+((title:t&e title:t)/no_coord) +title:e/*". It seems like a bug within the query parser. I also have validated the parsed filter query with the analysis component. The result was "*/+title:t&e +title:t +title:e/*". The behavior is equal on all special characters that split words into 2 parts. I use the following WordDelimiterFilter on query side: generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1"/> Thanks. Johannes Additional informations: Debug before the update: *:* *:* MatchAllDocsQuery(*:*) *:* LuceneQParser (title:((T&E))) * ** **+title:t&e +title:t +title:e ** ** * ... Debug after the update: *:* *:* MatchAllDocsQuery(*:*) *:* LuceneQParser (title:((T&E))) * ** **+((title:t&e title:t)/no_coord) +title:e ** *** ... "title"-field definition: positionIncrementGap="100" omitNorms="true"> mapping="mapping.txt"/> generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1" preserveOriginal="1" stemEnglishPossessive="0"/> mapping="mapping.txt"/> synonyms="synonyms.txt" ignoreCase="true" expand="false"/> generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1"/>
changed query behavior
Hi, I have updated my solr instance from 4.5.1 to 4.7.1. Now my solr query failing some tests. Query: q=*:*&fq=(title:((T&E)))?debug=true Before the update: *:* *:* MatchAllDocsQuery(*:*) *:* LuceneQParser (title:((T&E))) +title:t&e +title:t +title:e ... After the update: *:* *:* MatchAllDocsQuery(*:*) *:* LuceneQParser (title:((T&E))) +((title:t&e title:t)/no_coord) +title:e ... Before update the query deliver only one result. Now the query deliver three results. Do you have any idea why the parsed_filter_queries is "+((title:t&e title:t)/no_coord) +title:e" instead of "+title:t&e +title:t +title:e"? "title"-field definition: positionIncrementGap="100" omitNorms="true"> mapping="mapping.txt"/> generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1" preserveOriginal="1" stemEnglishPossessive="0"/> mapping="mapping.txt"/> synonyms="synonyms.txt" ignoreCase="true" expand="false"/> generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1"/> The default query operator is AND. Thanks! Johannes
Re: solr-query with NOT and OR operator
Hi Jack, thanks! fq=((*:* -(field1:value1)))+OR+(field2:value2). This is the solution. Johannes Am 11.02.2014 17:22, schrieb Jack Krupansky: With so many parentheses in there, I wonder what you are really trying to do Try expressing your query in simple English first so that we can understand your goal. But generally, a purely negative nested query must have a *:* term to apply the exclusion against: fq=((*:* -(field1:value1)))+OR+(field2:value2). -- Jack Krupansky -Original Message- From: Johannes Siegert Sent: Tuesday, February 11, 2014 10:57 AM To: solr-user@lucene.apache.org Subject: solr-query with NOT and OR operator Hi, my solr-request contains the following filter-query: fq=((-(field1:value1)))+OR+(field2:value2). I expect solr deliver documents matching to ((-(field1:value1))) and documents matching to (field2:value2). But solr deliver only documents, that are the result of (field2:value2). I receive several documents, if I request only for ((-(field1:value1))). Thanks! Johannes -- Johannes Siegert Softwareentwickler Telefon: 0351 - 418 894 -73 Fax: 0351 - 418 894 -99 E-Mail: johannes.sieg...@marktjagd.de Xing: https://www.xing.com/profile/Johannes_Siegert2 Webseite: http://www.marktjagd.de Blog: http://blog.marktjagd.de Facebook: http://www.facebook.com/marktjagd Twitter: http://twitter.com/Marktjagd __ Marktjagd GmbH | Schützenplatz 14 | D - 01067 Dresden Geschäftsführung: Jan Großmann Sitz Dresden | Amtsgericht Dresden | HRB 28678
solr-query with NOT and OR operator
Hi, my solr-request contains the following filter-query: fq=((-(field1:value1)))+OR+(field2:value2). I expect solr deliver documents matching to ((-(field1:value1))) and documents matching to (field2:value2). But solr deliver only documents, that are the result of (field2:value2). I receive several documents, if I request only for ((-(field1:value1))). Thanks! Johannes
Re: high memory usage with small data set
Hi Erick, thanks for your reply. What do you exactly mean with "Do your used entries in your caches increase in parallel?"? I update the indices every hour and commit the changes. So a new searcher with empty or autowarmed caches should be created and the old one should be removed. Johannes Am 30.01.2014 15:08, schrieb Erick Erickson: Do your used entries in your caches increase in parallel? This would be the case if you aren't updating your index and would explain it. BTW, take a look at your cache statistics (from the admin page) and look at the cache hit ratios. If they are very small (and my guess is that with 1,500 boolean operations, you aren't getting significant re-use) then you're just wasting space, try the cache=false option. Also, how are you measuring memory? It's sometimes confusing that virtual memory can be include, see: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Best, Erick On Wed, Jan 29, 2014 at 7:49 AM, Johannes Siegert wrote: Hi, we are using Apache Solr Cloud within a production environment. If the maximum heap-space is reached the Solr access time slows down, because of the working garbage collector for a small amount of time. We use the following configuration: - Apache Tomcat as webserver to run the Solr web application - 13 indices with about 150 entries (300 MB) - 5 server with one replication per index (5 GB max heap-space) - All indices have the following caches - maximum document-cache-size is 4096 entries, all other indices have between 64 and 1536 entries - maximum query-cache-size is 1024 entries, all other indices have between 64 and 768 - maximum filter-cache-size is 1536 entries, all other i ndices have between 64 and 1024 - the directory-factory-implementation is NRTCachingDirectoryFactory - the index is updated once per hour (no auto commit) - ca. 5000 requests per hour per server - large filter-queries (up to 15000 bytes and 1500 boolean operations) - many facet-queries (30%) Behaviour: Started with 512 MB heap space. Over several days the heap-space grow up, until the 5 GB was reached. At this moment the described problem occurs. From this time on the heap-space-useage is between 50 and 90 percent. No OutOfMemoryException occurs. Questions: 1. Why does Solr use 5 GB ram, with this small amount of data? 2. Which impact does the large filter-queries have in relation to ram usage? Thanks! Johannes Siegert
high memory usage with small data set
Hi, we are using Apache Solr Cloud within a production environment. If the maximum heap-space is reached the Solr access time slows down, because of the working garbage collector for a small amount of time. We use the following configuration: - Apache Tomcat as webserver to run the Solr web application - 13 indices with about 150 entries (300 MB) - 5 server with one replication per index (5 GB max heap-space) - All indices have the following caches - maximum document-cache-size is 4096 entries, all other indices have between 64 and 1536 entries - maximum query-cache-size is 1024 entries, all other indices have between 64 and 768 - maximum filter-cache-size is 1536 entries, all other i ndices have between 64 and 1024 - the directory-factory-implementation is NRTCachingDirectoryFactory - the index is updated once per hour (no auto commit) - ca. 5000 requests per hour per server - large filter-queries (up to 15000 bytes and 1500 boolean operations) - many facet-queries (30%) Behaviour: Started with 512 MB heap space. Over several days the heap-space grow up, until the 5 GB was reached. At this moment the described problem occurs. From this time on the heap-space-useage is between 50 and 90 percent. No OutOfMemoryException occurs. Questions: 1. Why does Solr use 5 GB ram, with this small amount of data? 2. Which impact does the large filter-queries have in relation to ram usage? Thanks! Johannes Siegert