Get Recently Added/Updated Documents
Hi, I have the following scenario: - there are 2 machines running solr 4.8.1 - there are different time zones on both machines - the clock is not synchronized on both machines Autorefresh query running each X-2 seconds should return documents for last X seconds and the performance impact should be low as much as possible (perfectly, should take less then second). First of all, I added first-component that overrides NOW param set by main shard in order to treat the local NOW time on each solr machine. And I added a new custom function recent_docs(ms_since_now(_version_),X)=recip(ms(NOW,_version_ to milliseconds),0.01/X,1,1). Then I thought about 2 possible solutions but there is disadvantage for each one and now I try to decide which one is the most optimal. And maybe there are another solutions that I didn't think about. 1. *Solution 1*: use boosting for _version_ field like this: q={!boost b=recent_docs(ms_since_now(_version_),X)}*:* 1. _version_ because I need to receive the recently updated documents and the time of the document shouldn't be changed. And I saw from the code that the _version_ is calculated based on the time 2. It's good for sorting because all documents are sorting by scoring but in this case all documents are matched and I need to return only documents with score from [0.1 to 1]. I may filter by _version_ field but I prefer not to do it due to performance. 3. *Question*: 1. what is the performance impact for such scoring? 2. *how can I return only documents with scoring from 0.1 to 1*? 2. *Solution 2*: use query function like this: fq={!frange l=0.1 u=1}recent_docs(ms_since_now(_version_),X) 1. in this case only relevant documents are returned but they are not sorted and sorting by _version_ or adding scoring seems is not efficient because in such case the same function will be claculated twice 2. it seems that there is very high performance impact to use this query function on large cores with hundred millions of documents 3. *Questions*: 1. *what is the most optimal way to sort the returned documents without calculating twice the same function*? 2. and what is the performance impact of such filter query, is FieldCache is used? 3. May it drastic increase the memory consumption of solr on very updated cores with millions of documents? Any assistance/suggestion/comment will be very appreciated. Thank you. Best regards, Lyuba
adding support for deleteInstanceDir from solrj
Hi all, Did anyone have a chance to look at the code? It's attached here: https://issues.apache.org/jira/browse/SOLR-5023. Thank you very much. Lyuba
Re: [Solr 4.2] deleteInstanceDir is added to CoreAdminHandler but is not supported in Unload CoreAdminRequest
According to code, at least in Solr 4.2, getParams of CoreAdminRequest.Unload returns locally created ModifiableSolrParams. It means that parameters that are set in such way won't be received in CoreAdminHandler. I'm going to open an issue in Jira and provide a patch for this. Best regards, Lyuba On Fri, Jul 5, 2013 at 6:12 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: SolrJ doesn't have explicit support for that param but you can always add it yourself. For example: CoreAdminRequest.Unload req = new CoreAdminRequest.Unload(false); ((ModifiableSolrParams) req.getParams()).set(deleteInstanceDir, true); req.process(server); On Thu, Jul 4, 2013 at 12:50 PM, Lyuba Romanchuk lyuba.romanc...@gmail.com wrote: Hi, I need to unload core with deleting instance directory of the core. According to code of Solr4.2 I don't see the support for this parameter in solrj. Is there the fix or open issue for this? Best regards, Lyuba -- Regards, Shalin Shekhar Mangar.
[Solr 4.2] deleteInstanceDir is added to CoreAdminHandler but is not supported in Unload CoreAdminRequest
Hi, I need to unload core with deleting instance directory of the core. According to code of Solr4.2 I don't see the support for this parameter in solrj. Is there the fix or open issue for this? Best regards, Lyuba
Re: [Solr 4.2.1] LotsOfCores - Can't query cores with loadOnStartup=true and transient=true
Hi Erick, I opened an issue in JIRA: SOLR-4850. But I don't see how to change an assignee, I don't think that I have permissions to do it. Thank you. Best regards, Lyuba On Mon, May 20, 2013 at 6:05 PM, Erick Erickson erickerick...@gmail.comwrote: Lyuba: Could you go ahead and raise a JIRA and assign it to me to investigate? You should definitely be able to define cores this way. Thanks, Erick On Sun, May 19, 2013 at 9:27 AM, Lyuba Romanchuk lyuba.romanc...@gmail.com wrote: Hi, It seems like in order to query transient cores they must be defined with loadOnStartup=false. I define one core loadOnStartup=true and transient=false, and another cores to be loadOnStartup=true and transient=true, and transientCacheSize=Integer.MAX_VALUE. In this case CoreContainer.dynamicDescriptors will be empty and then CoreContainer.getCoreFromAnyList(String) and CoreContainer.getCore(String) returns null for all transient cores. I looked at the code of 4.3.0 and it doesn't seem that the flow was changed, the core is added only if it's not loaded on start up. Could you please assist with this issue? Best regards, Lyuba
[Solr 4.2.1] LotsOfCores - Can't query cores with loadOnStartup=true and transient=true
Hi, It seems like in order to query transient cores they must be defined with loadOnStartup=false. I define one core loadOnStartup=true and transient=false, and another cores to be loadOnStartup=true and transient=true, and transientCacheSize=Integer.MAX_VALUE. In this case CoreContainer.dynamicDescriptors will be empty and then CoreContainer.getCoreFromAnyList(String) and CoreContainer.getCore(String) returns null for all transient cores. I looked at the code of 4.3.0 and it doesn't seem that the flow was changed, the core is added only if it's not loaded on start up. Could you please assist with this issue? Best regards, Lyuba
Re: Solr 4.0 - timeAllowed in distributed search
Hi Michael, Thank you very much for your reply! Does it mean that when timeAllowed is used only search is interrupted and document retrieval is not? In order to check the total time of the query I run curl with linux time to measure the total time including retrieving of documents. If I understood your answer correctly I had to get a similar total time in both cases but according to the results they are similar to QTime and to each other: - for non distributed: QTime=789 ms when total time is ~1 sec - for distributed: QTime=7.75 sec and total time is 7.9 sec. Here is the output of the curls (direct_query.xml and distributed_query.xml contain 30,000 documents in the reply): Directly ask the shard:** time curl ' http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3timeAllowed=500partialResults=truedebugQuery=true ' direct_query.xml real0m1.025s user0m0.008s sys 0m0.053s from direct_query.xml: lst name=responseHeader bool name=partialResultstrue/bool int name=status0/int int name=QTime789/int lst name=params str name=rows3/strstr name=q*:*/str str name=timeAllowed500/str str name=partialResultstrue/strstr name=debugQuerytrue/str/lst/lstresult name=response numFound=28965249 start=0 Ask the shard through distributed search: *time curl ' http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3shards=127.0.0.1%3A8983%2Fsolr%2Fshard_2013-01-07timeAllowed=500partialResults=trueshards.info=truedebug=true ' * distributed_query.xml real0m7.905s user0m0.010s sys 0m0.052s from distributed_query.xml: lst name=responseHeader bool name=partialResultstrue/bool int name=status0/int int name=QTime7750/int lst name=params str name=q*:*/str str name=debugtrue/str str name=shards127.0.0.1:8983/solr/shard_2013-01-07/str str name=partialResultstrue/str str name=shards.infotrue/str str name=rows3/str str name=timeAllowed500/str/lst/lst lst name=shards.info lst name=127.0.0.1:8983/solr/shard_2013-01-07long name=numFound28193020/longfloat name=maxScore1.0/floatlong name=time895/long/lst/lst result name=response numFound=28193020 start=0 maxScore=1.0 Best regards, Lyuba On Sun, Jan 20, 2013 at 6:49 PM, Michael Ryan mr...@moreover.com wrote: (This is based on my knowledge of 3.6 - not sure if this has changed in 4.0) You are using rows=3, which requires retrieving 3 documents from disk. In a non-distributed search, the QTime will not include the time it takes to retrieve these documents, but in a distributed search, it will. For a *:* query, the document retrieval will almost always be the slowest part of the query. I'd suggest measuring how long it takes for the response to be returned, or use rows=0. The timeAllowed feature is very misleading. It only applies to a small portion of the query (which in my experience is usually not the part of the query that is actually slow). Do not depend on timeAllowed doing anything useful :) -Michael -Original Message- From: Lyuba Romanchuk [mailto:lyuba.romanc...@gmail.com] Sent: Sunday, January 20, 2013 6:36 AM To: solr-user@lucene.apache.org Subject: Solr 4.0 - timeAllowed in distributed search Hi, I try to use timeAllowed in query both in distributed search with one shard and directly to the same shard. I send the same query with timeAllowed=500 : - directly to the shard then QTime ~= 600 ms - through distributes search to the same shard QTime ~= 7 sec. I have two questions: - It seems that timeAllowed parameter doesn't work for distributes search, does it? - What may be the reason that causes the query to the shard through distributes search takes much more time than to the shard directly (the same distribution remains without timeAllowed parameter in the query)? Test results: Ask one shard through distributed search: http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3shards=127.0.0.1%3A8983%2Fsolr%2Fshard_2013-01-07timeAllowed=500partialResults=trueshards.info=truedebugQuery=true response lst name=responseHeader bool name=partialResultstrue/bool int name=status0/int int name=QTime7307/int lst name=params str name=q*:*/str str name=shards127.0.0.1:8983/solr/shard_2013-01-07/str str name=partialResultstrue/str str name=debugQuerytrue/str str name=shards.infotrue/str str name=rows3/str str name=timeAllowed500/str/lst/lst lst name=shards.info lst name=127.0.0.1:8983/solr/shard_2013-01-07 long name=numFound29574223/long float name=maxScore1.0/float long name=time646/long/lst/lst result name=response numFound=29574223 start=0 maxScore=1.0 ... 30,000 docs ... lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedqueryMatchAllDocsQuery(*:*)/str str name=parsedquery_toString*:*/str str name=QParserLuceneQParser/str lst name=timingdouble name=time6141.0/double lst name=preparedouble name=time0.0/double lst name
[Solr 4.0] what is stored in .tim index file format?
Hi, I have index ~31G where 27% of the index size is .fdt files (8.5G) 20% - .fdx files (6.2G) 37% - .frq files (11.6G) 16% - .tim files (5G) I didn't manage to find the description for .tim files. Can you help me with this? Thank you. Best regards, Lyuba
[Solr 4.0] Is it possible to do soft commit from code and not configuration only
Hi, I need to configure the solr so that the opened searcher will see a new document immidiately after it was adding to the index. And I don't want to perform commit each time a new document is added. I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but it didn't help. Is there way to perform soft commit from code in Solr 4.0 ? Thank you in advance. Best regards, Lyuba
Re: [Solr 4.0] Is it possible to do soft commit from code and not configuration only
Hi Mark, Thank you for reply. I tried to normalize data like in relational databases: - there are some types of documents where \ - documents with the same type have the same fields - documents with not equal types may have different fields - but all documents have type field and unique key field id . - there is main type (all records with this type contains pointers to the corresponding records of other types) There is the configuration that defines what information should be stored in each type. When I get a new data for indexing first of all I check if such document is already in the index\ using facets by the corresponding fields and query on relevant type. I add documents to solr index without commit from the code but with autocommit and autoSoftCommit with maxDocs=1 in the solrconfig.xml. But here there is a problem that if I add a new record for some type the searcher doesn't see it immediately. It causes that I get some equal records with the same type but different ids (unique key). If I do commit from code after each document is added it works OK but it's not a solution. So I wanted to try to do soft commit after adding documents with not-main type from code. I searched in wiki documents but found only commit without parameters and commit with parameters that don't seem to be what I need. Best regards, Lyuba * * * * On Thu, Apr 12, 2012 at 6:55 PM, Mark Miller markrmil...@gmail.com wrote: On Apr 12, 2012, at 11:28 AM, Lyuba Romanchuk wrote: Hi, I need to configure the solr so that the opened searcher will see a new document immidiately after it was adding to the index. And I don't want to perform commit each time a new document is added. I tried to configure maxDocs=1 under autoSoftCommit in solrconfig.xml but it didn't help. Can you elaborate on didn't help? You couldn't find any docs unless you did an explicit commit? If that is true and there is no user error, this would be a bug. Is there way to perform soft commit from code in Solr 4.0 ? Yes - check out the wiki docs - I can't remember how it is offhand (I think it was slightly changed recently). Thank you in advance. Best regards, Lyuba - Mark Miller lucidimagination.com
[Solr 4.0] soft commit with API of Solr 4.0
Hi All, Is there way to perform soft commit from code in Solr 4.0 ? Is it possible only from solrconfig.xml through enabling autoSoftCommit with maxDocs and/or maxTime attributes? Thank you in advance. Best regards, Lyuba
[Solr 3.5] Facets and stats become a lot slower during concurrent inserts
Hi, I test facets and stats in Solr 3.5 and I see that queries are running a lot slower during inserts into index with more than 15M documents . If I stop to insert new documents facet/stats queries run 10-1000 times faster than with concurrent inserts. I don't see this degradation in Lucene. Could you please explain what may cause this? Is it Solr related issue only? Thank you for help. Best regards, Lyuba
Re: [Solr 3.5] Facets and stats become a lot slower during concurrent inserts
autoCommit is disabled in solrconfig.xml and I use SolrServer::addBeans(beans, 100) for inserts. I need to insert new documents continually in high rate with concurrent running queries. Best regards, Lyuba On Tue, Dec 27, 2011 at 6:15 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Tue, Dec 27, 2011 at 10:43 AM, Lyuba Romanchuk lyuba.romanc...@gmail.com wrote: I test facets and stats in Solr 3.5 and I see that queries are running a lot slower during inserts into index with more than 15M documents . Are you also doing commits (or have autocommit enabled)? The first time a facet command is used for a field after a commit, certain data structures need to be constructed. To avoid slow first requests like this, you can add a request that does the faceting as a static warming query that will be run before any live queries use the new searcher. -Yonik http://www.lucidimagination.com