Re: omitTermFreq only?
Sorry I should have made the objectives clear. The goal is to reduce the index size by avoiding TermFrequency stored in the index (in .frq segment files). After exploring a bit more, realized that LUCENE-2048 now allows omitPositions. Similarly, I'm looking for a omitFrequency option. Thanks, -Jibo On Jul 13, 2011, at 1:34 PM, Markus Jelsma wrote: > A dirty hack is to return 1.0f for each tf > 0. Just a couple of lines code > for a custom similarity class. > >> Hello, >> >> I was wondering if there is a way we can omit only the Term Frequency in >> solr? >> >> omitTermFreqAndPositions =true wouldn't work for us since we need the >> positions for supporting phrase queries. >> >> Thanks, >> -Jibo
omitTermFreq only?
Hello, I was wondering if there is a way we can omit only the Term Frequency in solr? omitTermFreqAndPositions =true wouldn't work for us since we need the positions for supporting phrase queries. Thanks, -Jibo
omitTermFreq only?
Hello, I was wondering if there is a way we can omit only the Term Frequency in solr? omitTermFreqAndPositions =true wouldn't work for us since we need the positions for supporting phrase queries. Thanks, -Jibo
FieldType for storing date
Hello, I was wondering what would be the best FieldType for storing date with a millisecond precision that would allow me to sort and run range queries against this field. We would like to achieve the best query performance, minimal heap - fieldcache - requirements, good indexing throughput and minimal index size in that order. To give you some background, we have a production system that runs in a multicore set up, each core with a maximum index size of 6G each, and, the search and indexing operations occur against the same cores. We store the date with a minutely precision(format yymmddhhmm), and, we use TrieIntField with a precisionStep=1. This works well, however, as a next step, we want to store the date with a millisecond precision with minimal architectural changes. We could probably use TrieLongField, however, as we understand, this doubles the heap requirements for fieldcache. Was wondering if there is a clever way of achieving this without adding to the heap. Appreciate your input. Thanks, -Jibo
parsedquery becomes PhraseQuery
Hello, I have a question on how solr determines whether the q value needs to be analyzed as a regular query or as a phrase query. Let's say, I have a text'jibojohn info disk/1.0' If I query for 'jibojohn info', I get the results. The query is parsed as: jibojohn info jibojohn info +data:jibojohn +data:info +data:jibojohn +data:info However, if I query for 'disk/1.0', I get nothing. The query is parsed as: disk/1.0 disk/1.0 PhraseQuery(data:"disk 1 0") data:"disk 1 0" I was expecting this to be treated as a regular query, instead of a phrase query. I was wondering why. Appreciate your input. -Jibo
Re: Invoke "expungeDeletes" using SolrJ's SolrServer.commit()
Created jira issue https://issues.apache.org/jira/browse/SOLR-1487 Thanks, -Jibo On Oct 2, 2009, at 2:17 PM, Shalin Shekhar Mangar wrote: On Sat, Oct 3, 2009 at 1:35 AM, Jibo John wrote: Hello, I know I can invoke expungeDeletes using updatehandler ( curl update -F stream.body=' ' ), however, I was wondering if it is possible to invoke it using SolrJ. It looks like, currently, there are no SolrServer.commit(..) methods that I can use for this purpose. Any input will be helpful. You are right. Please create an issue. We need this in 1.4 -- Regards, Shalin Shekhar Mangar.
Invoke "expungeDeletes" using SolrJ's SolrServer.commit()
Hello, I know I can invoke expungeDeletes using updatehandler ( curl update - F stream.body=' ' ), however, I was wondering if it is possible to invoke it using SolrJ. It looks like, currently, there are no SolrServer.commit(..) methods that I can use for this purpose. Any input will be helpful. Thanks, -Jibo
Re: Problem changing the default MergePolicy/Scheduler
On Sep 27, 2009, at 9:42 PM, Shalin Shekhar Mangar wrote: On Mon, Sep 28, 2009 at 2:59 AM, Jibo John wrote: Additionally, I get the same exception even if I declare the in the . class="org.apache.lucene.index.LogByteSizeMergePolicy"> true That should be instead of Yeah, that was it. Thank you very much. Thanks, -Jibo -- Regards, Shalin Shekhar Mangar.
Re: Problem changing the default MergePolicy/Scheduler
Additionally, I get the same exception even if I declare the in the . class="org.apache.lucene.index.LogByteSizeMergePolicy"> true Thanks, -Jibo On Sep 27, 2009, at 2:03 PM, Jibo John wrote: Thanks for this. I've updated trunk/, rebuilt solr.war, however, running into another issue. Sep 27, 2009 1:55:44 PM org.apache.solr.common.SolrException log SEVERE: java.lang.IllegalArgumentException at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) at sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:592) at org .apache.solr.util.SolrPluginUtils.invokeSetters(SolrPluginUtils.java: 989) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java: 87) at org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java: 185) at org .apache .solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java: 98) at org .apache .solr .update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) at org .apache .solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java: 220) at org .apache .solr .update .processor .RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java: 140) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) Also, the log file has a bunch of these lines: Sep 27, 2009 1:55:56 PM org.apache.solr.update.SolrIndexWriter finalize SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! Here is the snippet from my solrconfig.xml false 6 500 1 1000 1 class="org.apache.lucene.index.LogByteSizeMergePolicy"> true org.apache.lucene.index.ConcurrentMergeSchedulermergeScheduler> single Thanks, -Jibo On Sep 27, 2009, at 1:43 PM, Shalin Shekhar Mangar wrote: On Mon, Sep 28, 2009 at 1:18 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: On Sat, Sep 26, 2009 at 7:13 AM, Jibo John wrote: Hello, It looks like solr is not allowing me to change the default MergePolicy/Scheduler classes. Even if I change the default MergePolicy/ Scheduler(LogByteSizeMErgePolicy and ConcurrentMergeScheduler) defined in solrconfig.xml to a different one (LogDocMergePolicy and SerialMergeScheduler), my profiler shows the default classes are still being loaded. Also, if I use the default LogByteSizeMergePolicy, I can't seem to override the 'calibrateSizeByDeletes' to 'true' value using solrconfig using the new syntax that was introduced this week (SOLR-1447). I'm using the version checked out from trunk yesterday. Any pointers will be helpful. Specifying mergePolicy and mergeScheduler in does not work in trunk. If you specify in the section, it will work. I'll give a patch with a fix. This is fixed in trunk now. Thanks! -- Regards, Shalin Shekhar Mangar.
Re: Problem changing the default MergePolicy/Scheduler
Thanks for this. I've updated trunk/, rebuilt solr.war, however, running into another issue. Sep 27, 2009 1:55:44 PM org.apache.solr.common.SolrException log SEVERE: java.lang.IllegalArgumentException at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) at sun .reflect .DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 25) at java.lang.reflect.Method.invoke(Method.java:592) at org .apache.solr.util.SolrPluginUtils.invokeSetters(SolrPluginUtils.java: 989) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:87) at org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java: 185) at org .apache .solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98) at org .apache .solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java: 173) at org .apache .solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220) at org .apache .solr .update .processor .RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) Also, the log file has a bunch of these lines: Sep 27, 2009 1:55:56 PM org.apache.solr.update.SolrIndexWriter finalize SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! Here is the snippet from my solrconfig.xml false 6 500 1 1000 1 class="org.apache.lucene.index.LogByteSizeMergePolicy"> true org.apache.lucene.index.ConcurrentMergeSchedulermergeScheduler> single Thanks, -Jibo On Sep 27, 2009, at 1:43 PM, Shalin Shekhar Mangar wrote: On Mon, Sep 28, 2009 at 1:18 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: On Sat, Sep 26, 2009 at 7:13 AM, Jibo John wrote: Hello, It looks like solr is not allowing me to change the default MergePolicy/Scheduler classes. Even if I change the default MergePolicy/ Scheduler(LogByteSizeMErgePolicy and ConcurrentMergeScheduler) defined in solrconfig.xml to a different one (LogDocMergePolicy and SerialMergeScheduler), my profiler shows the default classes are still being loaded. Also, if I use the default LogByteSizeMergePolicy, I can't seem to override the 'calibrateSizeByDeletes' to 'true' value using solrconfig using the new syntax that was introduced this week (SOLR-1447). I'm using the version checked out from trunk yesterday. Any pointers will be helpful. Specifying mergePolicy and mergeScheduler in does not work in trunk. If you specify in the section, it will work. I'll give a patch with a fix. This is fixed in trunk now. Thanks! -- Regards, Shalin Shekhar Mangar.
Problem changing the default MergePolicy/Scheduler
Hello, It looks like solr is not allowing me to change the default MergePolicy/Scheduler classes. Even if I change the default MergePolicy/ Scheduler(LogByteSizeMErgePolicy and ConcurrentMergeScheduler) defined in solrconfig.xml to a different one (LogDocMergePolicy and SerialMergeScheduler), my profiler shows the default classes are still being loaded. Also, if I use the default LogByteSizeMergePolicy, I can't seem to override the 'calibrateSizeByDeletes' to 'true' value using solrconfig using the new syntax that was introduced this week (SOLR-1447). I'm using the version checked out from trunk yesterday. Any pointers will be helpful. Thanks, -Jibo
Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?
On Sep 17, 2009, at 1:30 PM, Shalin Shekhar Mangar wrote: On Fri, Sep 18, 2009 at 1:06 AM, Jibo John wrote: Hello, Came across a lucene patch ( http://issues.apache.org/jira/browse/LUCENE-1634) that would consider the number of deleted documents as the criteria when deciding which segments to merge. Since we expect to have very frequent deletes, we hope this would help reclaim the space consumed by the deleted documents in a much more efficient way. Currently, we can specify a mergepolicy in solrconfig.xml like this: However, by default, calibrateSizeByDeletes = false in LogMergePolicy. I was wondering if there is a way I can modify calibrateSizeByDeletes just by configuration ? Alas, no. The only option that I see for you is to sub-class LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in the constructor. However, please open a Jira issue and so we don't forget about it. Created a jira issue https://issues.apache.org/jira/browse/SOLR-1444 Also, you might be interested in expungeDeletes which has been added as a request parameter for commits. Calling commit with expungeDeletes=true will remove all deleted documents from the index but unlike an optimize it won't always reduce the index to a single segment. Thanks for this information. Will explore this. -- Regards, Shalin Shekhar Mangar. Thanks, -Jibo
How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?
Hello, Came across a lucene patch (http://issues.apache.org/jira/browse/LUCENE-1634 ) that would consider the number of deleted documents as the criteria when deciding which segments to merge. Since we expect to have very frequent deletes, we hope this would help reclaim the space consumed by the deleted documents in a much more efficient way. Currently, we can specify a mergepolicy in solrconfig.xml like this: However, by default, calibrateSizeByDeletes = false in LogMergePolicy. I was wondering if there is a way I can modify calibrateSizeByDeletes just by configuration ? Thanks, -Jibo
Re: Solr 1.4 Replication scheme
Slightly off topic one question on the index file transfer mechanism used in the new 1.4 Replication scheme. Is my understanding correct that the transfer is over http? (vs. rsync in the script-based snappuller) Thanks, -Jibo On Aug 14, 2009, at 6:34 AM, Yonik Seeley wrote: Longer term, it might be nice to enable clients to specify what version of the index they were searching against. This could be used to prevent consistency issues across different slaves, even if they commit at different times. It could also be used in distributed search to make sure the index didn't change between phases. -Yonik http://www.lucidimagination.com 2009/8/14 Noble Paul നോബിള് नोब्ळ् : On Fri, Aug 14, 2009 at 2:28 PM, KaktuChakarabati wrote: Hey Noble, you are right in that this will solve the problem, however it implicitly assumes that commits to the master are infrequent enough ( so that most polling operations yield no update and only every few polls lead to an actual commit. ) This is a relatively safe assumption in most cases, but one that couples the master update policy with the performance of the slaves - if the master gets updated (and committed to) frequently, slaves might face a commit on every 1-2 poll's, much more than is feasible given new searcher warmup times.. In effect what this comes down to it seems is that i must make the master commit frequency the same as i'd want the slaves to use - and this is markedly different than previous behaviour with which i could have the master get updated(+committed to) at one rate and slaves committing those updates at a different rate. I see , the argument. But , isn't it better to keep both the mster and slave as consistent as possible? There is no use in committing in master, if you do not plan to search on those docs. So the best thing to do is do a commit only as frequently as you wish to commit in a slave. On a different track, if we can have an option of disabling commit after replication, is it worth it? So the user can trigger a commit explicitly Noble Paul നോബിള് नोब्ळ्-2 wrote: usually the pollInterval is kept to a small value like 10secs. there is no harm in polling more frequently. This can ensure that the replication happens at almost same time On Fri, Aug 14, 2009 at 1:58 PM, KaktuChakarabati> wrote: Hey Shalin, thanks for your prompt reply. To clarity: With the old script-based replication, I would snappull every x minutes (say, on the order of 5 minutes). Assuming no index optimize occured ( I optimize 1-2 times a day so we can disregard it for the sake of argument), the snappull would take a few seconds to run on each iteration. I then have a crontab on all slaves that runs snapinstall on a fixed time, lets say every 15 minutes from start of a round hour, inclusive. (slave machine times are synced e.g via ntp) so that essentially all slaves will begin a snapinstall exactly at the same time - assuming uniform load and the fact they all have at this point in time the same snapshot since I snappull frequently - this leads to a fairly synchronized replication across the board. With the new replication however, it seems that by binding the pulling and installing as well specifying the timing in delta's only (as opposed to "absolute-time" based like in crontab) we've essentially made it impossible to effectively keep multiple slaves up to date and synchronized; e.g if we set poll interval to 15 minutes, a slight offset in the startup times of the slaves (that can very much be the case for arbitrary resets/ maintenance operations) can lead to deviations in snappull(+install) times. this in turn is further made worse by the fact that the pollInterval is then computed based on the offset of when the last commit *finished* - and this number seems to have a higher variance, e.g due to warmup which might be different across machines based on the queries they've handled previously. To summarize, It seems to me like it might be beneficial to introduce a second parameter that acts more like a crontab time-based tableau, in so far that it can enable a user to specify when an actual commit should occur - so then we can have the pollInterval set to a low value (e.g 60 seconds) but then specify to only perform a commit on the 0,15,30,45-minutes of every hour. this makes the commit times on the slaves fairly deterministic. Does this make sense or am i missing something with current in- process replication? Thanks, -Chak Shalin Shekhar Mangar wrote: On Fri, Aug 14, 2009 at 8:39 AM, KaktuChakarabati wrote: In the old replication, I could snappull with multiple slaves asynchronously but perform the snapinstall on each at the same time (+- epsilon seconds), so that way production load balanced query serving will always be consistent. With the new system it seems that i have no control over syncing them, but rather it polls every
Re: Storing string field in solr.ExternalFieldFile type
Thanks for the quick response, Otis. We have been able to achieve the ratio of 2 with different settings, however, considering the huge volume of the data that we need to deal with - 600 GB of data per day, and, we need to keep it in the index for 3 days - we're looking at all possible ways to reduce the index size further. Will definitely keep exploring the straightforward things and see if we can find a better setting. Thanks, -Jibo On Jul 23, 2009, at 9:49 AM, Otis Gospodnetic wrote: I'm not sure if there is a lot of benefit from storing the literal values in that external file vs. directly in the index. There are a number of things one should look at first, as far as performance is concerned - JVM settings, cache sizes, analysis, etc. For example, I have one index here that is 9 times the size of the original data because of how its fields are analyzed. I can change one analysis-level setting and make that ratio go down to 2. So I'd look at other, more straight forward things first. There is a Wiki page either on Solr or Lucene Wiki dedicated to various search performance tricks. Otis -- Sematext is hiring: http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message ---- From: Jibo John To: solr-user@lucene.apache.org Sent: Thursday, July 23, 2009 12:08:26 PM Subject: Re: Storing string field in solr.ExternalFieldFile type Thanks for the response, Eric. We have seen that size of the index has a direct impact on the search speed, especially when the index size is in GBs, so trying all possible ways to keep the index size as low as we can. We thought solr.ExternalFileField type would help to keep the index size low by storing a text field out side of the index. Here's what we were planning: initially, all the fields except the solr.ExternalFileField type field will be queried and will be displayed to the end user. . There will be subsequent calls from the UI to pull the solr.ExternalFileField field that will be loaded in a lazy manner. However, realized that solr.ExternalFileField only supports float type, however, the data that we're planning to keep as an external field is a string type. Thanks, -Jibo On Jul 22, 2009, at 1:46 PM, Erick Erickson wrote: Hoping the experts chime in if I'm wrong, but As far as I know, while storing a field increases the size of an index, it doesn't have much impact on the search speed. Which you could pretty easily test by creating the index both ways and firing off some timing queries and comparing. Although it would be time consuming... I believe there's some info on the Lucene Wiki about this, but my memory isn't what it used to be. Erick On Tue, Jul 21, 2009 at 2:42 PM, Jibo John wrote: We're in the process of building a log searcher application. In order to reduce the index size to improve the query performance, we're exploring the possibility of having: 1. One field for each log line with 'indexed=true & stored=false' that will be used for searching 2. Another field for each log line of type solr.ExternalFileField that will be used only for display purpose. We realized that currently solr.ExternalFileField supports only float type. Is there a way we can override this to support string type? Any issues with this approach? Any ideas are welcome. Thanks, -Jibo
Re: Storing string field in solr.ExternalFieldFile type
Thanks for the response, Eric. We have seen that size of the index has a direct impact on the search speed, especially when the index size is in GBs, so trying all possible ways to keep the index size as low as we can. We thought solr.ExternalFileField type would help to keep the index size low by storing a text field out side of the index. Here's what we were planning: initially, all the fields except the solr.ExternalFileField type field will be queried and will be displayed to the end user. . There will be subsequent calls from the UI to pull the solr.ExternalFileField field that will be loaded in a lazy manner. However, realized that solr.ExternalFileField only supports float type, however, the data that we're planning to keep as an external field is a string type. Thanks, -Jibo On Jul 22, 2009, at 1:46 PM, Erick Erickson wrote: Hoping the experts chime in if I'm wrong, but As far as I know, while storing a field increases the size of an index, it doesn't have much impact on the search speed. Which you could pretty easily test by creating the index both ways and firing off some timing queries and comparing. Although it would be time consuming... I believe there's some info on the Lucene Wiki about this, but my memory isn't what it used to be. Erick On Tue, Jul 21, 2009 at 2:42 PM, Jibo John wrote: We're in the process of building a log searcher application. In order to reduce the index size to improve the query performance, we're exploring the possibility of having: 1. One field for each log line with 'indexed=true & stored=false' that will be used for searching 2. Another field for each log line of type solr.ExternalFileField that will be used only for display purpose. We realized that currently solr.ExternalFileField supports only float type. Is there a way we can override this to support string type? Any issues with this approach? Any ideas are welcome. Thanks, -Jibo
Storing string field in solr.ExternalFieldFile type
We're in the process of building a log searcher application. In order to reduce the index size to improve the query performance, we're exploring the possibility of having: 1. One field for each log line with 'indexed=true & stored=false' that will be used for searching 2. Another field for each log line of type solr.ExternalFileField that will be used only for display purpose. We realized that currently solr.ExternalFileField supports only float type. Is there a way we can override this to support string type? Any issues with this approach? Any ideas are welcome. Thanks, -Jibo