Boosting documents matching in a specific shard
I am aware that IDF is not distributed. Suppose I have to boost or give higher rank to documents which are matching in a specific/particular shard, how can I accomplish that? ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
RE: NGram Indexing Basic Question
Thanks Erick. Actually it was going in as a phrase query. I set the following filter and things are perfect -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, July 20, 2012 5:23 PM To: solr-user@lucene.apache.org Subject: Re: NGram Indexing Basic Question Try attaching &debugQuery=on to your query and look at the parsed query. My first guess is that your default operator is AND (or a.op in modern terms) and the ngram with "dl" in it is required. Please paste the results here if that's not the cause. Best Erick On Thu, Jul 19, 2012 at 7:29 AM, Husain, Yavar wrote: > I have set some of my fields to be NGram Indexed. Have also set analyzer both > at query as well as index level. > > Most of the stuff works fine except for use cases where I simply interchange > couple of characters. > > For an example: "springfield" retrieves correct matches, "springfi" retrieves > correct matches, "ingfield" retrieves correct matches. > > However when i say "springfiedl" it returns 0 results. I debugged and found > that at query/index level I have all correct N-Grams stored. So ideally it > should match "springfie" (which is there both in Query NGram and Index NGram) > and return me the correct results. > > As I was busy so did not get time to look at the code for NGram. What ideally > happens when I use NGram at Query level? Does it split the strings into > N-Grams and then send each of them to Solr Server? > > Thanks Sahi for your help yesterday. Appreciate that. > > > ** > This message may contain confidential or > proprietary information intended only for the use of > theaddressee(s) named above or may contain information that is > legally privileged. If you arenot the intended addressee, or the > person responsible for delivering it to the intended addressee,you > are hereby notified that reading, disseminating, distributing or > copying this message is strictlyprohibited. If you have received > this message by mistake, please immediately notify us byreplying > to the message and delete the original message and any copies immediately > thereafter. Thank you.~ > ** > FAFLD >
NGram Indexing Basic Question
I have set some of my fields to be NGram Indexed. Have also set analyzer both at query as well as index level. Most of the stuff works fine except for use cases where I simply interchange couple of characters. For an example: "springfield" retrieves correct matches, "springfi" retrieves correct matches, "ingfield" retrieves correct matches. However when i say "springfiedl" it returns 0 results. I debugged and found that at query/index level I have all correct N-Grams stored. So ideally it should match "springfie" (which is there both in Query NGram and Index NGram) and return me the correct results. As I was busy so did not get time to look at the code for NGram. What ideally happens when I use NGram at Query level? Does it split the strings into N-Grams and then send each of them to Solr Server? Thanks Sahi for your help yesterday. Appreciate that. **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
RE: NGram for misspelt words
Thanks Sahi. I have replaced my EdgeNGramFilterFactory to NGramFilterFactory as I need substrings not just in front or back but anywhere. You are right I put the same NGramFilterFactory in both Query and Index however now it does not return any results not even the basic one. -Original Message- From: Dikchant Sahi [mailto:contacts...@gmail.com] Sent: Wednesday, July 18, 2012 7:54 PM To: solr-user@lucene.apache.org Subject: Re: NGram for misspelt words You are creating grams only while indexing and not querying hence 'ludlwo' would not match. Your analyzer will create the following grams while indexing for 'ludlow': lu lud ludl ludlo ludlow and hence would not match to 'ludlwo'. Either you need to create gram while querying also or use Edit Distance. On Wed, Jul 18, 2012 at 7:43 PM, Husain, Yavar wrote: > > > > I have configured NGram Indexing for some fields. > > Say I search for the city Ludlow, I get the results (normal search) > > If I search for Ludlo (with w ommitted) I get the results > > If I search for Ludl (with ow ommitted) I still get the results > > I know that they are all partial strings of the main string hence > NGram works perfect. > > But when I type in Ludlwo (misspelt, characters o and w interchanged) > I dont get any results, It should ideally match "Ludl" and provide the > results. > > I am not looking for Edit distance based Spell Correctors. How can I > make above NGram based search work? > > Here is my schema.xml (NGramFieldType): > > stored="false" multiValued="true"> > > > > > > > > > > maxGramSize="15" side="front" /> > > > > > > > > > > > > > > > > > > > > > ** > This message may contain confidential or > proprietary information intended only for the use of > theaddressee(s) named above or may contain information that is > legally privileged. If you arenot the intended addressee, or the > person responsible for delivering it to the intended addressee,you > are hereby notified that reading, disseminating, distributing or > copying this message is strictlyprohibited. If you have received > this message by mistake, please immediately notify us byreplying > to the message and delete the original message and any copies > immediately thereafter. Thank you.~ > > ** > > FAFLD > >
NGram for misspelt words
I have configured NGram Indexing for some fields. Say I search for the city Ludlow, I get the results (normal search) If I search for Ludlo (with w ommitted) I get the results If I search for Ludl (with ow ommitted) I still get the results I know that they are all partial strings of the main string hence NGram works perfect. But when I type in Ludlwo (misspelt, characters o and w interchanged) I dont get any results, It should ideally match "Ludl" and provide the results. I am not looking for Edit distance based Spell Correctors. How can I make above NGram based search work? Here is my schema.xml (NGramFieldType): **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
RE: Solr On Fly Field creation from full text for N-Gram Indexing
Thanks Jack. I tried (Regex Transformer) it out and the indexing has gone really slow. Is it (RegEx Transformer) slower than N-Gram Indexing? I mean they may be apples and oranges but what I mean is finally after extracting the field I want to NGram Index it. So It seems going in for NGram Indexing of Full Text (i.e. without extracting what I need using RegexTransformer) is a better solution ignoring space complexity?? Any views? THANKS!! -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, May 10, 2012 4:09 PM To: solr-user@lucene.apache.org Subject: Re: Solr On Fly Field creation from full text for N-Gram Indexing You can use "Regex Transformer" to extract from a source field. See: http://wiki.apache.org/solr/DataImportHandler#RegexTransformer -- Jack Krupansky -Original Message----- From: Husain, Yavar Sent: Thursday, May 10, 2012 6:04 AM To: solr-user@lucene.apache.org Subject: Solr On Fly Field creation from full text for N-Gram Indexing I have full text in my database and I am indexing that using Solr. Now at runtime i.e. when the indexing is going on can I extract certain parameters based on regex and create another field/column on the fly using Solr for that extracted text? For example my DB has just 2 columns (DocId & FullText): DocIdFullText 1My name is Avi. RoleId: GYUIOP-MN-1087456. . Now say while indexing I want to extract RoleId, place it in another column created on fly and index that column using N-Gram indexing. I dont want to go for N-Gram of Full text as that would be too time expensive. Thanks!! Any clues would be appreciated. **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
Solr On Fly Field creation from full text for N-Gram Indexing
I have full text in my database and I am indexing that using Solr. Now at runtime i.e. when the indexing is going on can I extract certain parameters based on regex and create another field/column on the fly using Solr for that extracted text? For example my DB has just 2 columns (DocId & FullText): DocIdFullText 1My name is Avi. RoleId: GYUIOP-MN-1087456. . Now say while indexing I want to extract RoleId, place it in another column created on fly and index that column using N-Gram indexing. I dont want to go for N-Gram of Full text as that would be too time expensive. Thanks!! Any clues would be appreciated. **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
RE: Apache Tomcat 6 service terminated unexpectedly. It has done this 2 time(s).
I am sorry, i should have raised this issue on tomcat forums. However just was trying my luck here as it was indirectly related to solr. From: Husain, Yavar Sent: Monday, April 23, 2012 11:07 PM To: solr-user@lucene.apache.org Subject: Apache Tomcat 6 service terminated unexpectedly. It has done this 2 time(s). Solr 3.5 was not returning results. To my surprise Tomcat 6.x (64 bit) was not running on my Windows. There were absolutely no errors in the logs, no crash dumps nothing. I restarted it and everything seems to be fine now. Went to the Windows Event viewer and exported the following information as it relates to Tomcat: Level Date and Time Source Event IDTask Category Information 04/23/2012 8:51:58 AM Service Control Manager 7036None The Apache Tomcat 6 service entered the running state. Error04/23/2012 4:17:12 AM Service Control Manager 7034None The Apache Tomcat 6 service terminated unexpectedly. It has done this 2 time(s). Information 04/16/2012 3:13:15 PM Service Control Manager 7036None The Apache Tomcat 6 service entered the running state. Error04/16/2012 1:12:47 PM Service Control Manager 7034None The Apache Tomcat 6 service terminated unexpectedly. It has done this 1 time(s). Information 04/07/2012 10:02:25 PM Service Control Manager 7036None The Apache Tomcat 6 service entered the running state. It is a mystery for me as I dont have any errors in the Tomcat logs. How should I go ahead debugging this problem? Any help would be appreciated. ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
Apache Tomcat 6 service terminated unexpectedly. It has done this 2 time(s).
Solr 3.5 was not returning results. To my surprise Tomcat 6.x (64 bit) was not running on my Windows. There were absolutely no errors in the logs, no crash dumps nothing. I restarted it and everything seems to be fine now. Went to the Windows Event viewer and exported the following information as it relates to Tomcat: Level Date and Time Source Event IDTask Category Information 04/23/2012 8:51:58 AM Service Control Manager 7036None The Apache Tomcat 6 service entered the running state. Error04/23/2012 4:17:12 AM Service Control Manager 7034None The Apache Tomcat 6 service terminated unexpectedly. It has done this 2 time(s). Information 04/16/2012 3:13:15 PM Service Control Manager 7036None The Apache Tomcat 6 service entered the running state. Error04/16/2012 1:12:47 PM Service Control Manager 7034None The Apache Tomcat 6 service terminated unexpectedly. It has done this 1 time(s). Information 04/07/2012 10:02:25 PM Service Control Manager 7036None The Apache Tomcat 6 service entered the running state. It is a mystery for me as I dont have any errors in the Tomcat logs. How should I go ahead debugging this problem? Any help would be appreciated. ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
RE: Regarding Indexing Multiple Columns Best Practise
Thanks Erick!! -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, March 16, 2012 6:58 PM To: solr-user@lucene.apache.org Subject: Re: Regarding Indexing Multiple Columns Best Practise I would *guess* you won't notice much/any difference. Note that, if you use a fieldType with the increment gap > 1 (the default is often set to 100), phrase queries (slop) will perform differently depending upon which option you choose. Best Erick On Thu, Mar 15, 2012 at 10:49 AM, Husain, Yavar wrote: > Say I have around 30-40 fields (SQL Table Columns) indexed using Solr from > the database. I concatenate those fields into one field by using Solr > copyfield directive and than make it default search field which I search. > > If at the database level itself I perform concatenation of all those fields > into one field and then index that field directly (it will avoid copy > operation of Solr of each field to that concatenated field) will it be a > indexing performance improvement? I am sure it will be but will it make a > big/huge change in indexing running time? > > Thanks > ** > This message may contain confidential or > proprietary information intended only for the use of the > addressee(s) named above or may contain information that is legally > privileged. If you are not the intended addressee, or the person > responsible for delivering it to the intended addressee, you are > hereby notified that reading, disseminating, distributing or copying > this message is strictly prohibited. If you have received this message by > mistake, please immediately notify us by replying to the message and delete > the original message and any copies immediately thereafter. > > Thank you.- > ** > > FAFLD >
Indexing Halts for long time and then restarts
Since Erick is really active answering now so posting a quick question :) I am using: DIH Solr 3.5 on Windows Building Auto Recommendation Utility Having around 1 Billion Query Strings (3-6 words each) in database. Indexing them using NGram. Merge Factor = 30 Auto Commit not set. DIH halted after indexing 7 million for around 25 minutes and was not showing any increment in the Total Documents Processed/Fetched, ofcourse it was doing some stuff, was it some merge stuff?. After 25 minutes it started moving again. Due to this indexing time has increased a lot. Any help will be appreciated. Thanks. **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
Regarding Indexing Multiple Columns Best Practise
Say I have around 30-40 fields (SQL Table Columns) indexed using Solr from the database. I concatenate those fields into one field by using Solr copyfield directive and than make it default search field which I search. If at the database level itself I perform concatenation of all those fields into one field and then index that field directly (it will avoid copy operation of Solr of each field to that concatenated field) will it be a indexing performance improvement? I am sure it will be but will it make a big/huge change in indexing running time? Thanks ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
RE: Solr out of memory exception
Thanks a ton. From: Li Li [fancye...@gmail.com] Sent: Thursday, March 15, 2012 12:11 PM To: Husain, Yavar Cc: solr-user@lucene.apache.org Subject: Re: Solr out of memory exception it seems you are using 64bit jvm(32bit jvm can only allocate about 1.5GB). you should enable pointer compression by -XX:+UseCompressedOops On Thu, Mar 15, 2012 at 1:58 PM, Husain, Yavar mailto:yhus...@firstam.com>> wrote: Thanks for helping me out. I have allocated Xms-2.0GB Xmx-2.0GB However i see Tomcat is still using pretty less memory and not 2.0G Total Memory on my Windows Machine = 4GB. With smaller index size it is working perfectly fine. I was thinking of increasing the system RAM & tomcat heap space allocated but then how come on a different server with exactly same system and solr configuration & memory it is working fine? -Original Message- From: Li Li [mailto:fancye...@gmail.com<mailto:fancye...@gmail.com>] Sent: Thursday, March 15, 2012 11:11 AM To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> Subject: Re: Solr out of memory exception how many memory are allocated to JVM? On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar mailto:yhus...@firstam.com>> wrote: > Solr is giving out of memory exception. Full Indexing was completed fine. > Later while searching maybe when it tries to load the results in memory it > starts giving this exception. Though with the same memory allocated to > Tomcat and exactly same solr replica on another server it is working > perfectly fine. I am working on 64 bit software's including Java & Tomcat > on Windows. > Any help would be appreciated. > > Here are the logs: > > The server encountered an internal error (Severe errors in solr > configuration. Check your log files for more detailed information on what > may be wrong. If you want solr to continue after configuration errors, > change: false in > null - > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at > org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at > org.apache.solr.core.SolrCore.(SolrCore.java:579) at > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) > at > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) > at > org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) > at > org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) > at > org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115) > at > org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072) > at > org.apache.catalina.core.StandardContext.start(StandardContext.java:4726) > at > org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799) > at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779) > at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601) at > org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943) at > org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778) at > org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504) at > org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at > org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324) > at > org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142) > at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065) at > org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at > org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) at > org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) at > org.apache.catalina.core.StandardService.start(StandardService.java:525) at > org.apache.catalina.core.StandardServer.start(StandardServer.java:754) at > org.apache.catalina.startup.Catalina.start(Catalina.java:595) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at > java.lang.reflect.Method.invoke(Unknown Source) at > org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at > org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by: > java.lang.OutOfMemoryError: Java heap space at > org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:180) > at org.apache.lucene.index.TermInfosReader.(TermInfosReader.java:91) > at > org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:
RE: Solr out of memory exception
Thanks for helping me out. I have allocated Xms-2.0GB Xmx-2.0GB However i see Tomcat is still using pretty less memory and not 2.0G Total Memory on my Windows Machine = 4GB. With smaller index size it is working perfectly fine. I was thinking of increasing the system RAM & tomcat heap space allocated but then how come on a different server with exactly same system and solr configuration & memory it is working fine? -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Thursday, March 15, 2012 11:11 AM To: solr-user@lucene.apache.org Subject: Re: Solr out of memory exception how many memory are allocated to JVM? On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar wrote: > Solr is giving out of memory exception. Full Indexing was completed fine. > Later while searching maybe when it tries to load the results in memory it > starts giving this exception. Though with the same memory allocated to > Tomcat and exactly same solr replica on another server it is working > perfectly fine. I am working on 64 bit software's including Java & Tomcat > on Windows. > Any help would be appreciated. > > Here are the logs: > > The server encountered an internal error (Severe errors in solr > configuration. Check your log files for more detailed information on what > may be wrong. If you want solr to continue after configuration errors, > change: false in > null - > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at > org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at > org.apache.solr.core.SolrCore.(SolrCore.java:579) at > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) > at > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) > at > org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) > at > org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) > at > org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115) > at > org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072) > at > org.apache.catalina.core.StandardContext.start(StandardContext.java:4726) > at > org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799) > at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779) > at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601) at > org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943) at > org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778) at > org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504) at > org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at > org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324) > at > org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142) > at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065) at > org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at > org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) at > org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) at > org.apache.catalina.core.StandardService.start(StandardService.java:525) at > org.apache.catalina.core.StandardServer.start(StandardServer.java:754) at > org.apache.catalina.startup.Catalina.start(Catalina.java:595) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at > java.lang.reflect.Method.invoke(Unknown Source) at > org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at > org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by: > java.lang.OutOfMemoryError: Java heap space at > org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:180) > at org.apache.lucene.index.TermInfosReader.(TermInfosReader.java:91) > at > org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:122) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:652) at > org.apache.lucene.index.SegmentReader.get(SegmentReader.java:613) at > org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:104) at > org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:27) > at > org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74) > at > org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) > at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) at > org.apa
ngram synonyms & dismax together
I have ngram-indexed 2 fields (columns in the database) and the third one is my full text field. Now my default text field is the full text field and while querying I use dismax handler and specify in it both the ngrammed field with certain boost values and also full text field with a certain boost value. Problem for me if I dont use dismax and just search full text field(i.e. default field specified in schema) synonyms work correctly i.e. ca returns all results where california is there whereas if i use dismax ca is also searched in the ngrammed fields and return partial matches of the word ca and does not go at all in the synonym part. I want to use synonyms in every case so how should I go about it? ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
Dismax weird behaior wrt defType
A weird behavior with respect to "defType". Any clues will be appreciated. Query 1: http://localhost:8085/solr/select/?q=abc&version=2.2&start=0&rows=10&indent=on&defType=dismax [defType with capital T -- does not fetch results] Query 2: http://localhost:8085/solr/select/?q=abc&version=2.2&start=0&rows=10&indent=on&deftype=dismax [defType with small T -- perfect, results returned] In the above queries I have removed the boosting part because of which I am using dismax. And also while placing boosting stuff it is the other way round, deftype returns with capital T and does not return with small T. **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
RE: Spelling Corrector Algorithm
Thanks Robert. Yes thats right I can get some more accuracy if I use transposition in addition to substitution, insert and deletion. From: Robert Muir [rcm...@gmail.com] Sent: Thursday, March 01, 2012 9:50 PM To: solr-user@lucene.apache.org Subject: Re: Spelling Corrector Algorithm On Thu, Mar 1, 2012 at 6:43 AM, Husain, Yavar wrote: > Hi > > For spell checking component I set extendedResults to get the frequencies and > then select the word with the best frequency. I understand the spell check > algorithm based on Edit Distance. For an example: > > Query to Solr: Marien > > Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. > My dictionary here is based on indexed words. > > So I chose Market (more frequency) however which is wrong as my intent was > marine. Both have Edit Distance of 2. If you don't mind using trunk, just use directspellchecker, since it counts marine as an edit distance of 1 from marien (a transposition: https://issues.apache.org/jira/browse/LUCENE-3662) -- lucidimagination.com ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
RE: Spelling Corrector Algorithm
Thanks James. I loved the last line in your mail "But in the end, especially with 1-word queries, I doubt even the best algorithms are going to always accurately guess what the user wanted." Absolutely I agree to this; if it is a phrase (instead of single word) then probably we can apply some NLP stuff. From: Dyer, James [james.d...@ingrambook.com] Sent: Thursday, March 01, 2012 9:29 PM To: solr-user@lucene.apache.org Subject: RE: Spelling Corrector Algorithm Yavar, When you listed what the spell checker returns you put them in this order: > Marine (Freq: 120), Market (Freq: 900) and others Was "Marine" listed first, and then did you pick "Market" because you thought higher frequency is better? If so, you probably have the right settings already but need to trust it and go with the first result. If, on the other hand, the wrong suggestions truly are coming up first, you have 2 extension points: 1. You can change the comparator class. The default one sorts by "score" (distance) first and then "frequency" to break ties. There is also a pre-packed comparator that sorts just on frequency, or you can write your own (implementing Comparator) . But I doubt you'd want to change this one. (see http://wiki.apache.org/solr/SpellCheckComponent#Custom_Comparators_and_the_Lucene_Spell_Checkers_.28IndexBasedSpellChecker.2C_FileBasedSpellChecker.2C_DirectSolrSpellChecker.29 for more info) 2. You can change the distance metric. The default uses Levenshtein distance, but there is also an implementation for Jaro-Winkler distance. (see the wikipedia articles for these 2 if you want to know the subtle differences). It almost seems to me that Jaro-Winkler might give you better results but you'd have to test. See the example under http://wiki.apache.org/solr/SpellCheckComponent?highlight=%28distanceMeasure%29#Configuration for more information on how to configure this. If neither distance measure works for you, you could try implementing your own by creating a class implementing "org.apache.lucene.search.spell.StringDistance", then specify your class for the "distanceMeasure" parameter. Finally, there are some other possibly easy solutions to your problem you should test before going through the trouble of writing custom code: 1. Try a higher "spellcheck.count". Even if you only want a couple of results the algorithm works better with this set >5 (10-20 might be optimial in some cases). 2. Use DirectSolrSpellChecker, if on 4.x. This one is not influenced by "spellcheck.count", so you can truly set it to 1 if all you want is 1 result. 3. Use "spellcheck.collate=true" and set "spellcheck.maxCollationTries" to maybe 5 or 10. This will try the various suggestions by querying the index along with any other query parameters (other keywords, filters, etc), letting you know which suggestions are going to truly return hits in context (and how many). 4. Try Jaro-Winkler (as mentioned above). Hope this helps. But in the end, especially with 1-word queries, I doubt even the best algorithms are going to always accurately guess what the user wanted. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Husain, Yavar [mailto:yhus...@firstam.com] Sent: Thursday, March 01, 2012 5:43 AM To: solr-user@lucene.apache.org Subject: Spelling Corrector Algorithm Hi For spell checking component I set extendedResults to get the frequencies and then select the word with the best frequency. I understand the spell check algorithm based on Edit Distance. For an example: Query to Solr: Marien Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. My dictionary here is based on indexed words. So I chose Market (more frequency) however which is wrong as my intent was marine. Both have Edit Distance of 2. Now how can I improve this Algorithm to select marine instead of market (based on something more than edit distance and frequency stuff)? Do I have to incorporate some "soundex" algorithms too? I am looking for simple stuff which I can quickly implement. I even tried using Peter Norvig's spell corrector Algorithm (which is great) but again I ran in same problems. **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistak
Spelling Corrector Algorithm
Hi For spell checking component I set extendedResults to get the frequencies and then select the word with the best frequency. I understand the spell check algorithm based on Edit Distance. For an example: Query to Solr: Marien Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. My dictionary here is based on indexed words. So I chose Market (more frequency) however which is wrong as my intent was marine. Both have Edit Distance of 2. Now how can I improve this Algorithm to select marine instead of market (based on something more than edit distance and frequency stuff)? Do I have to incorporate some "soundex" algorithms too? I am looking for simple stuff which I can quickly implement. I even tried using Peter Norvig's spell corrector Algorithm (which is great) but again I ran in same problems. **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
Solr Basic Performance Testwith duplicated data
Will testing Solr based on duplicated data in the database result in same performance statistics as compared to testing Solr with completely unique data? By test I mean routine performance tests like time to index, time to search etc. Will solr perform any kind of optimization that will result in different statistics for both duplicated data and unique data? **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
Solr Indexing Running Time 32bit vs 64bit
I was running 32 bit Java (JDK, JRE & Tomcat) on my 64 bit Windows. For indexing I was not able to allocate more than 1.5GB Heap Space on my machine. Each time my tomcat process used to touch the upper bound (i.e. 1.5GB) very quickly so I thought of working on 64 bit Java/Tomcat. Now I dont see any substantial changes in the indexing running time (though I have increased the heap size now) , it is still taking the same time what it was taking while running on 32 bit software. However the only difference is now the memory consumed by Tomcat is extremely low i.e. now it is not touching that 1.5GB mark. What maybe the reason for this? **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
Solr Tomcat Maximum Heap Memory
I know this is a Solr forum however my problem is related to Solr running on Tomcat running on Windows 64 bit OS. I am running a 32 bit JVM on a 64 bit Windows 2008 Server. The max heap space I am able to allocate is around 1.5 GB though I have 10 GB of RAM on my system and there is no other process running. I understand the limit of max 2GB of heap space that can be allocated on Windows for a process. However I have seen in the forums people state using Xmx upto 10G. How is this possible? If I move to Linux, can I get more heap space allocated to the process or is it related to JVM? Simply put, how can I allocate atleast 8GB of RAM as Xmx to Tomcat on my 64 bit Windows. The tomcat crashes when I start. Please help. **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
RE: Solr sentiment analysis
This is a generic Machine Learning question and is not related to Solr (for which this thread is). You can ask this question on Stackoverflow.com. However one of the approaches: Just go through the chapter in O'reilly Programming Collective Intelligence on Non Negative Matrix Factorization. That might help you out. It's simple and concise. -Original Message- From: maha [mailto:mahab...@gmail.com] Sent: Friday, December 16, 2011 12:19 PM To: solr-user@lucene.apache.org Subject: Re: Solr sentiment analysis hai i am dng research in sentimental analysis.pls give your valuable suggestions.how to start my research -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-sentiment-analysis-tp3151415p3590933.html Sent from the Solr - User mailing list archive at Nabble.com. ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
SQL Server Solr RAM issue
When I start solr indexing RAM taken by MS SQL Server 2008 R2 also keeps on increasing & initially from some 1GB it went on to 3.6 GB (when indexing was completed for just 1 Million records/5GB). I have set the responseBuffering parameter to adaptive in data-config.xml however it didn't help me out. Even on killing the Tomcat, SQL Server is consuming the same memory (i.e. 3.6 GB), should it not release the memory and also just wanted to know why so much of RAM consumption by SQL Server during Indexing process? **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
RE: SolR - Index problems
Hi Jiggy When you query the index, what do you get in the tomcat logs? (Check that out in tomcat/logs directory) How much of Heap memory have you allocated to Tomcat? - Yavar From: jiggy [new...@trash-mail.com] Sent: Wednesday, December 07, 2011 9:53 PM To: solr-user@lucene.apache.org Subject: SolR - Index problems Hello Guys, i have a big problem. I have integrated solr to Magento EE. I have two solr folder, one is in c:/tomcat 7.0/ and the other one is in my web-folder(c:/www/). In the tomcat-folder is the data folder of solr, their are about 200 MB index file(I think here are my datas from magento). In the www-folder are the bin and conf folder of solr. My problem is now, when i try a query in the solr admin page, i don't get any result. My questions are: Is that right with the two folders? And Why i dont get any results ? I use windows server 2008 r2 and solr 1.4.1. Can anybody help, i read the reference guide and some contributions in this forum, but i dont got any result. Sorry for my bad english. Thanks in advance. Best regards, Jiggy -- View this message in context: http://lucene.472066.n3.nabble.com/SolR-Index-problems-tp3567883p3567883.html Sent from the Solr - User mailing list archive at Nabble.com. ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
RE: Autocommit & Index Size
Hi Shawn Absolutely perfect. It is always great reading your answers again and again as you explain the concepts so very well. Three cheers and thanks for your reply. Regards, Yavar From: Shawn Heisey [s...@elyograg.org] Sent: Wednesday, December 07, 2011 12:00 AM To: solr-user@lucene.apache.org Subject: Re: Autocommit & Index Size On 12/6/2011 1:01 AM, Husain, Yavar wrote: > In solrconfig.xml I was experimenting with Indexing Performance. When I set > the maxDocs (in autoCommit) to say 1 documents the index size is double > to if I just dont use autoCommit (i.e. keep it commented, i.e commit at the > end only after adding documents). > > Does autoCommit affect the index size and if so why (as it should only affect > performance) ? In other messages you have mentioned using MSSQL as a data source, so I am taking a wild guess and assuming that you are running on Windows. If that's not the case, then I don't know why you are seeing this behavior. On Windows, the timing involved when index segments are merged results in the old index segments being left on the disk, because they are still open for reading when Solr tries to delete them, and on Windows you cannot delete an open file. Later merge/optimize passes will remove files left over from the previous passes, but will themselves suffer from the same problem. I am guessing that when you don't have autoCommit turned on, Solr never opens an IndexReader on the segments as they are created, so Windows has no trouble deleting old segments after merging/optimizing. With autoCommit turned on, Solr opens the index segments and makes them searchable after every commit. Thanks, Shawn ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
Autocommit & Index Size
In solrconfig.xml I was experimenting with Indexing Performance. When I set the maxDocs (in autoCommit) to say 1 documents the index size is double to if I just dont use autoCommit (i.e. keep it commented, i.e commit at the end only after adding documents). Does autoCommit affect the index size and if so why (as it should only affect performance) ? - Yavar **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
RE: DIH Strange Problem
Thanks Kai for sharing this. Ian encountered the same problem so marking him in the mail too. From: Kai Gülzau [kguel...@novomind.com] Sent: Monday, November 28, 2011 6:55 PM To: solr-user@lucene.apache.org Subject: RE: DIH Strange Problem Do you use Java 6 update 29? There is a known issue with the latest mssql driver: http://blogs.msdn.com/b/jdbcteam/archive/2011/11/07/supported-java-versions-november-2011.aspx "In addition, there are known connection failure issues with Java 6 update 29, and the developer preview (non production) versions of Java 6 update 30 and Java 6 update 30 build 12. We are in contact with Java on these issues and we will update this blog once we have more information." Should work with update 28. Kai -Original Message----- From: Husain, Yavar [mailto:yhus...@firstam.com] Sent: Monday, November 28, 2011 1:02 PM To: solr-user@lucene.apache.org; Shawn Heisey Subject: RE: DIH Strange Problem I figured out the solution and Microsoft and not Solr is the problem here :): I downloaded and build latest Solr (3.4) from sources and finally hit following line of code in Solr (where I put my debug statement) : if(url != null){ LOG.info("Yavar: getting handle to driver manager:"); c = DriverManager.getConnection(url, initProps); LOG.info("Yavar: got handle to driver manager:"); } The call to Driver Manager was not returning. Here was the error!! The Driver we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded another driver called jTDS jDBC driver and installed that. Problem got fixed!!! So please follow the following steps: 1. Download jTDS jDBC driver from http://jtds.sourceforge.net/ 2. Put the driver jar file into your Solr/lib directory where you had put Microsoft JDBC driver. 3. In the data-config.xml use this statement: driver="net.sourceforge.jtds.jdbc.Driver" 4. Also in data-config.xml mention url like this: "url="jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX" 5. Now run your indexing. It should solve the problem. -Original Message- From: Husain, Yavar Sent: Thursday, November 24, 2011 12:38 PM To: solr-user@lucene.apache.org; Shawn Heisey Subject: RE: DIH Strange Problem Hi Thanks for your replies. I carried out these 2 steps (it did not solve my problem): 1. I tried setting responseBuffering to adaptive. Did not work. 2. For checking Database connection I wrote a simple java program to connect to database and fetch some results with the same driver that I use for solr. It worked. So it does not seem to be a problem with the connection. Now I am stuck where Tomcat log says: "Creating a connection for entity ." and does nothing, I mean after this log we usually get the "getConnection() took x millisecond" however I dont get that ,I can just see the time moving with no records getting fetched. Original Problem listed again: I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing data. Indexing and all was working perfectly fine. However today when I started full indexing again, Solr halts/stucks at the line "Creating a connection for entity." There are no further messages after that. I can see that DIH is busy and on the DIH console I can see "A command is still running", I can also see total rows fetched = 0 and total request made to datasource = 1 and time is increasing however it is not doing anything. This is the exact configuration that worked for me. I am not really able to understand the problem here. Also in the index directory where I am storing the index there are just 3 files: 2 segment files + 1 lucene*-write.lock file. ... data-config.xml: . . Logs: INFO: Server startup in 2016 ms Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=11 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [] REMOVING ALL DOCUMENTS FROM INDEX Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6] Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1322041133719 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity SampleText with URL: jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders -Original Message--
RE: DIH Strange Problem
I figured out the solution and Microsoft and not Solr is the problem here :): I downloaded and build latest Solr (3.4) from sources and finally hit following line of code in Solr (where I put my debug statement) : if(url != null){ LOG.info("Yavar: getting handle to driver manager:"); c = DriverManager.getConnection(url, initProps); LOG.info("Yavar: got handle to driver manager:"); } The call to Driver Manager was not returning. Here was the error!! The Driver we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded another driver called jTDS jDBC driver and installed that. Problem got fixed!!! So please follow the following steps: 1. Download jTDS jDBC driver from http://jtds.sourceforge.net/ 2. Put the driver jar file into your Solr/lib directory where you had put Microsoft JDBC driver. 3. In the data-config.xml use this statement: driver="net.sourceforge.jtds.jdbc.Driver" 4. Also in data-config.xml mention url like this: "url="jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX" 5. Now run your indexing. It should solve the problem. -Original Message- From: Husain, Yavar Sent: Thursday, November 24, 2011 12:38 PM To: solr-user@lucene.apache.org; Shawn Heisey Subject: RE: DIH Strange Problem Hi Thanks for your replies. I carried out these 2 steps (it did not solve my problem): 1. I tried setting responseBuffering to adaptive. Did not work. 2. For checking Database connection I wrote a simple java program to connect to database and fetch some results with the same driver that I use for solr. It worked. So it does not seem to be a problem with the connection. Now I am stuck where Tomcat log says: "Creating a connection for entity ." and does nothing, I mean after this log we usually get the "getConnection() took x millisecond" however I dont get that ,I can just see the time moving with no records getting fetched. Original Problem listed again: I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing data. Indexing and all was working perfectly fine. However today when I started full indexing again, Solr halts/stucks at the line "Creating a connection for entity." There are no further messages after that. I can see that DIH is busy and on the DIH console I can see "A command is still running", I can also see total rows fetched = 0 and total request made to datasource = 1 and time is increasing however it is not doing anything. This is the exact configuration that worked for me. I am not really able to understand the problem here. Also in the index directory where I am storing the index there are just 3 files: 2 segment files + 1 lucene*-write.lock file. ... data-config.xml: . . Logs: INFO: Server startup in 2016 ms Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=11 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [] REMOVING ALL DOCUMENTS FROM INDEX Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6] Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1322041133719 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity SampleText with URL: jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Wednesday, November 23, 2011 7:36 PM To: solr-user@lucene.apache.org Subject: Re: DIH Strange Problem On 11/23/2011 5:21 AM, Chantal Ackermann wrote: > Hi Yavar, > > my experience with similar problems was that there was something wrong > with the database connection or the database. > > Chantal It's also possible that your JDBC driver might be trying to buffer the entire result set. There's a link on the wiki specifically for this problem on MS SQL server. Hopefully it's that, but Chantal could be right too. http://wiki.apache.org/solr/DataImportHandlerFaq Here's the URL to the specific paragraph, but it's likely that it won't survive the email trip in a clickable form: http://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_MS_SQL_Server_database_with_sqljdbc_driver._DataImportHandler_is_going_o
RE: Unable to index documents using DataImportHandler with MSSQL
Hi Ian I downloaded and build latest Solr (3.4) from sources and finally hit following line of code in Solr (where I put my debug statement) : if(url != null){ LOG.info("Yavar: getting handle to driver manager:"); c = DriverManager.getConnection(url, initProps); LOG.info("Yavar: got handle to driver manager:"); } The call to Driver Manager was not returning. Here was the error!! The Driver we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded another driver called jTDS jDBC driver and installed that. Problem got fixed!!! So please follow the following steps: 1. Download jTDS jDBC driver from http://jtds.sourceforge.net/ 2. Put the driver jar file into your Solr/lib directory where you had put Microsoft JDBC driver. 3. In the data-config.xml use this statement: driver="net.sourceforge.jtds.jdbc.Driver" 4. Also in data-config.xml mention url like this: "url="jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX" 5. Now run your indexing. It should solve the problem. Regards, Yavar -Original Message- From: Ian Grainger [mailto:i...@isfluent.com] Sent: Monday, November 28, 2011 4:11 PM To: Husain, Yavar Cc: solr-user@lucene.apache.org Subject: Re: Unable to index documents using DataImportHandler with MSSQL Right. This is REALLY weird - I've now started from scratch on another machine (this time Windows 7), and got _exactly_ the same problem !? On Mon, Nov 28, 2011 at 7:37 AM, Husain, Yavar wrote: > Hi Ian > > I am having exactly the same problem what you are having on Win 7 and 2008 > Server http://lucene.472066.n3.nabble.com/DIH-Strange-Problem-tc3530370.html > > I still have not received any replies which could solve my problem till now. > Please do let me know if you have arrived at some solution for your problem. > > Thanks. > > Regards, > Yavar > > -Original Message- > From: Ian Grainger [mailto:i...@isfluent.com] > Sent: Friday, November 25, 2011 10:59 PM > To: solr-user@lucene.apache.org > Subject: Re: Unable to index documents using DataImportHandler with MSSQL > > Update on this: I've established: > * It's not a problem in the DB (I can index from this DB into a Solr > instance on another server) > * It's not Tomcat (I get the same problem in Jetty) > * It's not the schema (I have simplified it to one field) > > That leaves SolrConfig.xml and data-config. > > Only thing changed in SolrConfig.xml is adding: > > regex="apache-solr-cell-\d.*\.jar" /> > regex="apache-solr-clustering-\d.*\.jar" /> > regex="apache-solr-dataimporthandler-\d.*\.jar" /> > class="org.apache.solr.handler.dataimport.DataImportHandler"> > > D:/Software/Solr/example/solr/conf/data-config.xml > > > > And data-config.xml is pretty much as attached - except simpler. > > Any help or any advice on how to diagnose would be appreciated! > > > On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger wrote: >> Hi I have copied my Solr config from a working Windows server to a new >> one, and it can't seem to run an import. >> >> They're both using win server 2008 and SQL 2008R2. This is the data >> importer config >> >> >> > driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" >> url="jdbc:sqlserver://localhost;databaseName=DB" >> user="Solr" >> password="pwd"/> >> >> > query="EXEC SOLR_COMPANY_SEARCH_DATA" >> deltaImportQuery="SELECT * FROM Company_Search_Data WHERE >> [key]='${dataimporter.delta.key}'" >> deltaQuery="SELECT [key] FROM Company_Search_Data WHERE modify_dt >>> '${dataimporter.last_index_time}'"> >> > name="WorkDesc_Comments_Split" /> >> > /> >> >> >> >> >> I can use MS SQL Profiler to watch the Solr user log in successfully, >> but then nothing. It doesn't seem to even try and execute the stored >> procedure. Any ideas why this would be working one server and not on >> another? >> >> FTR the only thing in the tomcat catalina log is: >> >> org.apache.solr.handler.dataimport.JdbcDataSource$1 call >> INFO: Creating a connection for entity data with URL: >> jdbc:sqlserver://localhost;databaseName=CATLive >> >> -- >> Ian >> >> i...@isfluent.com >> +44 (0)1223 257903 >> > > > > -- > Ian > > i...@isfluent.com > +44 (0)1223 257903 >
RE: Unable to index documents using DataImportHandler with MSSQL
Hi Ian I am having exactly the same problem what you are having on Win 7 and 2008 Server http://lucene.472066.n3.nabble.com/DIH-Strange-Problem-tc3530370.html I still have not received any replies which could solve my problem till now. Please do let me know if you have arrived at some solution for your problem. Thanks. Regards, Yavar -Original Message- From: Ian Grainger [mailto:i...@isfluent.com] Sent: Friday, November 25, 2011 10:59 PM To: solr-user@lucene.apache.org Subject: Re: Unable to index documents using DataImportHandler with MSSQL Update on this: I've established: * It's not a problem in the DB (I can index from this DB into a Solr instance on another server) * It's not Tomcat (I get the same problem in Jetty) * It's not the schema (I have simplified it to one field) That leaves SolrConfig.xml and data-config. Only thing changed in SolrConfig.xml is adding: D:/Software/Solr/example/solr/conf/data-config.xml And data-config.xml is pretty much as attached - except simpler. Any help or any advice on how to diagnose would be appreciated! On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger wrote: > Hi I have copied my Solr config from a working Windows server to a new > one, and it can't seem to run an import. > > They're both using win server 2008 and SQL 2008R2. This is the data > importer config > > > driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" > url="jdbc:sqlserver://localhost;databaseName=DB" > user="Solr" > password="pwd"/> > > query="EXEC SOLR_COMPANY_SEARCH_DATA" > deltaImportQuery="SELECT * FROM Company_Search_Data WHERE > [key]='${dataimporter.delta.key}'" > deltaQuery="SELECT [key] FROM Company_Search_Data WHERE modify_dt >> '${dataimporter.last_index_time}'"> > name="WorkDesc_Comments_Split" /> > /> > > > > > I can use MS SQL Profiler to watch the Solr user log in successfully, > but then nothing. It doesn't seem to even try and execute the stored > procedure. Any ideas why this would be working one server and not on > another? > > FTR the only thing in the tomcat catalina log is: > > org.apache.solr.handler.dataimport.JdbcDataSource$1 call > INFO: Creating a connection for entity data with URL: > jdbc:sqlserver://localhost;databaseName=CATLive > > -- > Ian > > i...@isfluent.com > +44 (0)1223 257903 > -- Ian i...@isfluent.com +44 (0)1223 257903 ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
RE: DIH Strange Problem
Hi Thanks for your replies. I carried out these 2 steps (it did not solve my problem): 1. I tried setting responseBuffering to adaptive. Did not work. 2. For checking Database connection I wrote a simple java program to connect to database and fetch some results with the same driver that I use for solr. It worked. So it does not seem to be a problem with the connection. Now I am stuck where Tomcat log says: "Creating a connection for entity ." and does nothing, I mean after this log we usually get the "getConnection() took x millisecond" however I dont get that ,I can just see the time moving with no records getting fetched. Original Problem listed again: I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing data. Indexing and all was working perfectly fine. However today when I started full indexing again, Solr halts/stucks at the line "Creating a connection for entity." There are no further messages after that. I can see that DIH is busy and on the DIH console I can see "A command is still running", I can also see total rows fetched = 0 and total request made to datasource = 1 and time is increasing however it is not doing anything. This is the exact configuration that worked for me. I am not really able to understand the problem here. Also in the index directory where I am storing the index there are just 3 files: 2 segment files + 1 lucene*-write.lock file. ... data-config.xml: . . Logs: INFO: Server startup in 2016 ms Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=11 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [] REMOVING ALL DOCUMENTS FROM INDEX Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6] Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1322041133719 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity SampleText with URL: jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Wednesday, November 23, 2011 7:36 PM To: solr-user@lucene.apache.org Subject: Re: DIH Strange Problem On 11/23/2011 5:21 AM, Chantal Ackermann wrote: > Hi Yavar, > > my experience with similar problems was that there was something wrong > with the database connection or the database. > > Chantal It's also possible that your JDBC driver might be trying to buffer the entire result set. There's a link on the wiki specifically for this problem on MS SQL server. Hopefully it's that, but Chantal could be right too. http://wiki.apache.org/solr/DataImportHandlerFaq Here's the URL to the specific paragraph, but it's likely that it won't survive the email trip in a clickable form: http://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_MS_SQL_Server_database_with_sqljdbc_driver._DataImportHandler_is_going_out_of_memory._I_tried_adjustng_the_batchSize_values_but_they_don.27t_seem_to_make_any_difference._How_do_I_fix_this.3F Thanks, Shawn ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
DIH Strange Problem
I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing data. Indexing and all was working perfectly fine. However today when I started full indexing again, Solr halts/stucks at the line "Creating a connection for entity." There are no further messages after that. I can see that DIH is busy and on the DIH console I can see "A command is still running", I can also see total rows fetched = 0 and total request made to datasource = 1 and time is increasing however it is not doing anything. This is the exact configuration that worked for me. I am not really able to understand the problem here. Also in the index directory where I am storing the index there are just 3 files: 2 segment files + 1 lucene*-write.lock file. ... data-config.xml: . . Logs: INFO: Server startup in 2016 ms Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=11 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [] REMOVING ALL DOCUMENTS FROM INDEX Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6] Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1322041133719 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity SampleText with URL: jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
RE: Solr Performance/Architecture
Hi Shawn That was so great of you to explain the architecture in such a detail. I enjoyed reading it multiple times. I have a question here: You mentioned that we can use crc32(DocumentId)% NumServers. Now actually I am using that in my data-config.xml in the sql query itself, something like: For Documents to be indexed on Server 1: select DocumentId,PNum,... from Sample where crc32(DocumentId)%2=0; For Documents to be indexed on Server 2: select DocumentId,PNum,... from Sample where crc32(DocumentId)%2=1; Will that be a right way? Will it not be a slow query? Thanks once again. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Monday, November 21, 2011 7:47 PM To: solr-user@lucene.apache.org Subject: Re: Solr Performance/Architecture On 11/21/2011 12:41 AM, Husain, Yavar wrote: > Number of rows in SQL Table (Indexed till now using Solr): 1 million > Total Size of Data in the table: 4GB > Total Index Size: 3.5 GB > > Total Number of Rows that I have to index: 20 Million (approximately 100 GB > Data) and growing > > What is the best practices with respect to distributing the index? What I > mean to say here is when should I distribute and what is the magic number > that I can have for index size per instance? > > For 1 million itself Solr instance running on a VM is taking roughly 2.5 hrs > to index for me. So for 20 million roughly it would take 60 -70 hrs. That > would be too much. > > What would be the best distributed architecture for my case? It will be great > if people may share their best practices and experience. I have a MySQL database with 66 million rows at the moment, always growing. My Solr index is split into six large shards and a small shard with the newest data. The small shard (incremental) is calculated by looking at counts of data in hourly increments between 7 and 3.5 days old, and either choosing a boundary that results in less than 500,000 documents or the 3.5 day boundary. This index is usually about 1GB in size. The rest of the documents are split between the other six shards using crc32(did) % 6. The did field is a mysql bigint autoincrement field. These large shards are very close to 11 million records and 20GB each. By indexing all six at once, I can complete a full index rebuild in about 3.5 hours. Each full index chain lives on two 64GB Dell servers with dual quad-core processors. Each server contains a Solr instance with 8GB of heap, running three large shards. One server contains the incremental index, the other server runs the load balancer. Both servers run an index-free Solr core that we call the broker. Its search handlers have the shards parameter in solrconfig.xml, pointed at the appropriate cores for that index chain. To keep index size down and search speed up, it's important that your index only contain the fields needed for two purposes: Searching (indexed fields) and displaying a results grid (stored fields). Any other information should be excluded from your schema.xml and/or DIH config. Full item details should be populated from the database or other information store (possibly a filesystem), using the unique identifier from the search results. If you are aggregating data from more than one table, see if you can have your database get the information into one SELECT statement with JOINs, rather than having more than one entity in your DIH config. Alternatively, if your secondary tables are small, try using the CachedSQLEntityProcessor on them so they are loaded entirely into RAM for the import. Your database software is usually much better at combining tables than Solr, so take advantage of it. If you have multivalued search fields from secondary entities in DIH, you can often get your database software to CONCAT them together into a single field, then use an appropriate tokenizer to split them into separate terms. I have one such field that is semicolon separated by a database JOIN that's specified in a view, then I use a pattern tokenizer that splits it at index time. I hope this is helpful. Thanks, Shawn ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD
Solr Performance/Architecture
Number of rows in SQL Table (Indexed till now using Solr): 1 million Total Size of Data in the table: 4GB Total Index Size: 3.5 GB Total Number of Rows that I have to index: 20 Million (approximately 100 GB Data) and growing What is the best practices with respect to distributing the index? What I mean to say here is when should I distribute and what is the magic number that I can have for index size per instance? For 1 million itself Solr instance running on a VM is taking roughly 2.5 hrs to index for me. So for 20 million roughly it would take 60 -70 hrs. That would be too much. What would be the best distributed architecture for my case? It will be great if people may share their best practices and experience. Thanks!! **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
write-lock issue
Environment: Solr 1.4 on Windows/MS SQL Server A write lock is getting created whenever I am trying to do a full-import of documents using DIH. Logs say "Creating a connection with the database." and the process is not going forward (Not getting a database connection). So the indexes are not getting created. Note that no other process is accessing the index and even I restarted my MS SQL Server service. However still I see a write.lock file in my index directory. What could be the reason for this? Even I have set the flag unlockOnStartup in solrconfig to be true, still the indexing is not happening. **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
Solr Indexing Time
Solr 1.4 is doing great with respect to Indexing on a dedicated physical server (Windows Server 2008). For Indexing around 1 million full text documents (around 4 GB size) it takes around 20 minutes with Heap Size = 512M - 1G & 4GB RAM. However while using Solr on a VM, with 4 GB RAM it took 50 minutes to index at the first time. Note that there is no Network delays and no RAM issues. Now when I increased the RAM to 8GB and increased the heap size, the indexing time increased to 2 hrs. That was really strange. Note that except for SQL Server there is no other process running. There are no network delays. However I have not checked for File I/O. Can that be a bottleneck? Does Solr has any issues running in "Virtualization" Environment? I read a paper today by Brian & Harry: "ON THE RESPONSE TIME OF A SOLR SEARCH ENGINE IN A VIRTUALIZED ENVIRONMENT" & they claim that performance gets deteriorated when RAM is increased when Solr is running on a VM but that is with respect to query times and not indexing times. I am bit confused as to why it took longer on a VM when I repeated the same test second time with increased heap size and RAM. **This message may contain confidential or proprietary information intended only for the use of theaddressee(s) named above or may contain information that is legally privileged. If you arenot the intended addressee, or the person responsible for delivering it to the intended addressee,you are hereby notified that reading, disseminating, distributing or copying this message is strictlyprohibited. If you have received this message by mistake, please immediately notify us byreplying to the message and delete the original message and any copies immediately thereafter. Thank you.~ ** FAFLD
Solr Indexing Time varying each time I index
Solr 1.4 is doing great with respect to Indexing on a dedicated physical server (Windows Server 2008). For Indexing around 1 million full text documents (around 4 GB size) it takes around 20 minutes with Heap Size = 512M - 1G & 4GB RAM. However while using Solr on a VM, with 4 GB RAM it took 50 minutes to index at the first time. Note that there is no Network delays and no RAM issues. Now when I increased the RAM to 8GB and increased the heap size, the indexing time increased to 2 hrs. That was really strange. Note that except for SQL Server there is no other process running. There are no network delays. However I have not checked for File I/O. Can that be a bottleneck? Does Solr has any issues running in "Virtualization" Environment? I read a paper today by Brian & Harry: "ON THE RESPONSE TIME OF A SOLR SEARCH ENGINE IN A VIRTUALIZED ENVIRONMENT" & they claim that performance gets deteriorated when RAM is increased when Solr is running on a VM but that is with respect to query times and not indexing times. I am bit confused as to why it took longer on a VM when I repeated the same test second time with increased heap size and RAM. ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD