Boosting documents matching in a specific shard

2012-08-23 Thread Husain, Yavar
I am aware that IDF is not distributed. Suppose I have to boost or give higher 
rank to documents which are matching in a specific/particular shard, how can I 
accomplish that?
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



RE: NGram Indexing Basic Question

2012-07-20 Thread Husain, Yavar
Thanks Erick. Actually it was going in as a phrase query. I set the following 
filter and things are perfect



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, July 20, 2012 5:23 PM
To: solr-user@lucene.apache.org
Subject: Re: NGram Indexing Basic Question

Try attaching &debugQuery=on to your query and look at the parsed query. My 
first guess is that your default operator is AND (or a.op in modern terms) and 
the ngram with "dl" in it is required.

Please paste the results here if that's not the cause.

Best
Erick

On Thu, Jul 19, 2012 at 7:29 AM, Husain, Yavar  wrote:
> I have set some of my fields to be NGram Indexed. Have also set analyzer both 
> at query as well as index level.
>
> Most of the stuff works fine except for use cases where I simply interchange 
> couple of characters.
>
> For an example: "springfield" retrieves correct matches, "springfi" retrieves 
> correct matches, "ingfield" retrieves correct matches.
>
> However when i say "springfiedl" it returns 0 results. I debugged and found 
> that at query/index level I have all correct N-Grams stored. So ideally it 
> should match "springfie" (which is there both in Query NGram and Index NGram) 
> and return me the correct results.
>
> As I was busy so did not get time to look at the code for NGram. What ideally 
> happens when I use NGram at Query level? Does it split the strings into 
> N-Grams and then send each of them to Solr Server?
>
> Thanks Sahi for your help yesterday. Appreciate that.
> 
> 
> **
> This message may contain confidential or 
> proprietary information intended only for the use of 
> theaddressee(s) named above or may contain information that is 
> legally privileged. If you arenot the intended addressee, or the 
> person responsible for delivering it to the intended addressee,you 
> are hereby notified that reading, disseminating, distributing or 
> copying this message is strictlyprohibited. If you have received 
> this message by mistake, please immediately notify us byreplying 
> to the message and delete the original message and any copies immediately 
> thereafter.  Thank you.~ 
> **
> FAFLD
> 


NGram Indexing Basic Question

2012-07-19 Thread Husain, Yavar
I have set some of my fields to be NGram Indexed. Have also set analyzer both 
at query as well as index level.

Most of the stuff works fine except for use cases where I simply interchange 
couple of characters.

For an example: "springfield" retrieves correct matches, "springfi" retrieves 
correct matches, "ingfield" retrieves correct matches.

However when i say "springfiedl" it returns 0 results. I debugged and found 
that at query/index level I have all correct N-Grams stored. So ideally it 
should match "springfie" (which is there both in Query NGram and Index NGram) 
and return me the correct results.

As I was busy so did not get time to look at the code for NGram. What ideally 
happens when I use NGram at Query level? Does it split the strings into N-Grams 
and then send each of them to Solr Server?

Thanks Sahi for your help yesterday. Appreciate that.


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



RE: NGram for misspelt words

2012-07-18 Thread Husain, Yavar
Thanks Sahi. I have replaced my EdgeNGramFilterFactory to NGramFilterFactory as 
I need substrings not just in front or back but anywhere.
You are right I put the same NGramFilterFactory in both Query and Index however 
now it does not return any results not even the basic one.

-Original Message-
From: Dikchant Sahi [mailto:contacts...@gmail.com] 
Sent: Wednesday, July 18, 2012 7:54 PM
To: solr-user@lucene.apache.org
Subject: Re: NGram for misspelt words

You are creating grams only while indexing and not querying hence 'ludlwo'
would not match. Your analyzer will create the following grams while indexing 
for 'ludlow': lu lud ludl ludlo ludlow and hence would not match to 'ludlwo'.

Either you need to create gram while querying also or use Edit Distance.

On Wed, Jul 18, 2012 at 7:43 PM, Husain, Yavar  wrote:

>
>
>
> I have configured NGram Indexing for some fields.
>
> Say I search for the city Ludlow, I get the results (normal search)
>
> If I search for Ludlo (with w ommitted) I get the results
>
> If I search for Ludl (with ow ommitted) I still get the results
>
> I know that they are all partial strings of the main string hence 
> NGram works perfect.
>
> But when I type in Ludlwo (misspelt, characters o and w interchanged) 
> I dont get any results, It should ideally match "Ludl" and provide the 
> results.
>
> I am not looking for Edit distance based Spell Correctors. How can I 
> make above NGram based search work?
>
> Here is my schema.xml (NGramFieldType):
>
>  stored="false" multiValued="true">
>
> 
>
> 
>
> 
>
> 
>
>  maxGramSize="15" side="front" />
>
>
>
> 
>
> 
>
> 
>
> 
>
> 
>
> 
>
> 
>
>
> 
> 
> **
> This message may contain confidential or 
> proprietary information intended only for the use of 
> theaddressee(s) named above or may contain information that is 
> legally privileged. If you arenot the intended addressee, or the 
> person responsible for delivering it to the intended addressee,you 
> are hereby notified that reading, disseminating, distributing or 
> copying this message is strictlyprohibited. If you have received 
> this message by mistake, please immediately notify us byreplying 
> to the message and delete the original message and any copies 
> immediately thereafter.  Thank you.~
>
> **
> 
> FAFLD
> 
>


NGram for misspelt words

2012-07-18 Thread Husain, Yavar



I have configured NGram Indexing for some fields.

Say I search for the city Ludlow, I get the results (normal search)

If I search for Ludlo (with w ommitted) I get the results

If I search for Ludl (with ow ommitted) I still get the results

I know that they are all partial strings of the main string hence NGram works 
perfect.

But when I type in Ludlwo (misspelt, characters o and w interchanged) I dont 
get any results, It should ideally match "Ludl" and provide the results.

I am not looking for Edit distance based Spell Correctors. How can I make above 
NGram based search work?

Here is my schema.xml (NGramFieldType):
































**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



RE: Solr On Fly Field creation from full text for N-Gram Indexing

2012-05-10 Thread Husain, Yavar
Thanks Jack.

I tried (Regex Transformer) it out and the indexing has gone really slow. Is it 
(RegEx Transformer) slower than N-Gram Indexing? I mean they may be apples and 
oranges but what I mean is finally after extracting the field I want to NGram 
Index it. So It seems going in for NGram Indexing of Full Text (i.e. without 
extracting what I need using RegexTransformer) is a better solution ignoring 
space complexity??

Any views?

THANKS!!

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Thursday, May 10, 2012 4:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr On Fly Field creation from full text for N-Gram Indexing

You can use "Regex Transformer" to extract from a source field.

See:
http://wiki.apache.org/solr/DataImportHandler#RegexTransformer

-- Jack Krupansky

-Original Message-----
From: Husain, Yavar
Sent: Thursday, May 10, 2012 6:04 AM
To: solr-user@lucene.apache.org
Subject: Solr On Fly Field creation from full text for N-Gram Indexing

I have full text in my database and I am indexing that using Solr. Now at 
runtime i.e. when the indexing is going on can I extract certain parameters 
based on regex and create another field/column on the fly using Solr for that 
extracted text?

For example my DB has just 2 columns (DocId & FullText):

DocIdFullText
1My name is Avi. RoleId: GYUIOP-MN-1087456. .

Now say while indexing I want to extract RoleId, place it in another column 
created on fly and index that column using N-Gram indexing. I dont want to go 
for N-Gram of Full text as that would be too time expensive.

Thanks!! Any clues would be appreciated.


**This
message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.  Thank you.~ 
**
FAFLD
 



Solr On Fly Field creation from full text for N-Gram Indexing

2012-05-10 Thread Husain, Yavar
I have full text in my database and I am indexing that using Solr. Now at 
runtime i.e. when the indexing is going on can I extract certain parameters 
based on regex and create another field/column on the fly using Solr for that 
extracted text?

For example my DB has just 2 columns (DocId & FullText):

DocIdFullText
1My name is Avi. RoleId: GYUIOP-MN-1087456. .

Now say while indexing I want to extract RoleId, place it in another column 
created on fly and index that column using N-Gram indexing. I dont want to go 
for N-Gram of Full text as that would be too time expensive.

Thanks!! Any clues would be appreciated.


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



RE: Apache Tomcat 6 service terminated unexpectedly. It has done this 2 time(s).

2012-04-23 Thread Husain, Yavar
I am sorry, i should have raised this issue on tomcat forums. However just was 
trying my luck here as it was indirectly related to solr.

From: Husain, Yavar
Sent: Monday, April 23, 2012 11:07 PM
To: solr-user@lucene.apache.org
Subject: Apache Tomcat 6 service terminated unexpectedly.  It has done this 2 
time(s).

Solr 3.5 was not returning results. To my surprise Tomcat 6.x (64 bit) was not 
running on my Windows. There were absolutely no errors in the logs, no crash 
dumps nothing. I restarted it and everything seems to be fine now.

Went to the Windows Event viewer and exported the following information as it 
relates to Tomcat:

Level   Date and Time   Source  Event IDTask Category
Information  04/23/2012 8:51:58 AM  Service Control Manager 7036None
The Apache Tomcat 6 service entered the running state.
Error04/23/2012 4:17:12 AM  Service Control Manager 7034None
The Apache Tomcat 6 service terminated unexpectedly.  It has done this 2 
time(s).
Information  04/16/2012 3:13:15 PM  Service Control Manager 7036None
The Apache Tomcat 6 service entered the running state.
Error04/16/2012 1:12:47 PM  Service Control Manager 7034None
The Apache Tomcat 6 service terminated unexpectedly.  It has done this 1 
time(s).
Information  04/07/2012 10:02:25 PM Service Control Manager 7036None
The Apache Tomcat 6 service entered the running state.

It is a mystery for me as I dont have any errors in the Tomcat logs. How should 
I go ahead debugging this problem?

Any help would be appreciated.
**
This message may contain confidential or proprietary information intended only 
for the use of the
addressee(s) named above or may contain information that is legally privileged. 
If you are
not the intended addressee, or the person responsible for delivering it to the 
intended addressee,
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly
prohibited. If you have received this message by mistake, please immediately 
notify us by
replying to the message and delete the original message and any copies 
immediately thereafter.

Thank you.-
**
FAFLD



Apache Tomcat 6 service terminated unexpectedly. It has done this 2 time(s).

2012-04-23 Thread Husain, Yavar

Solr 3.5 was not returning results. To my surprise Tomcat 6.x (64 bit) was not 
running on my Windows. There were absolutely no errors in the logs, no crash 
dumps nothing. I restarted it and everything seems to be fine now.

Went to the Windows Event viewer and exported the following information as it 
relates to Tomcat:

Level   Date and Time   Source  Event IDTask Category
Information  04/23/2012 8:51:58 AM  Service Control Manager 7036None
The Apache Tomcat 6 service entered the running state.
Error04/23/2012 4:17:12 AM  Service Control Manager 7034None
The Apache Tomcat 6 service terminated unexpectedly.  It has done this 2 
time(s).
Information  04/16/2012 3:13:15 PM  Service Control Manager 7036None
The Apache Tomcat 6 service entered the running state.
Error04/16/2012 1:12:47 PM  Service Control Manager 7034None
The Apache Tomcat 6 service terminated unexpectedly.  It has done this 1 
time(s).
Information  04/07/2012 10:02:25 PM Service Control Manager 7036None
The Apache Tomcat 6 service entered the running state.

It is a mystery for me as I dont have any errors in the Tomcat logs. How should 
I go ahead debugging this problem?

Any help would be appreciated.
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



RE: Regarding Indexing Multiple Columns Best Practise

2012-03-16 Thread Husain, Yavar
Thanks Erick!!

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, March 16, 2012 6:58 PM
To: solr-user@lucene.apache.org
Subject: Re: Regarding Indexing Multiple Columns Best Practise

I would *guess* you won't notice much/any difference. Note that, if you use a 
fieldType with the increment gap > 1 (the default is often set to 100), phrase 
queries (slop) will perform differently depending upon which option you choose.

Best
Erick

On Thu, Mar 15, 2012 at 10:49 AM, Husain, Yavar  wrote:
> Say I have around 30-40 fields (SQL Table Columns) indexed using Solr from 
> the database. I concatenate those fields into one field by using Solr 
> copyfield directive and than make it default search field which I search.
>
> If at the database level itself I perform concatenation of all those fields 
> into one field and then index that field directly (it will avoid copy 
> operation of Solr of each field to that concatenated field) will it be a 
> indexing performance improvement? I am sure it will be but will it make a 
> big/huge change in indexing running time?
>
> Thanks
> **
>  This message may contain confidential or 
> proprietary information intended only for the use of the
> addressee(s) named above or may contain information that is legally 
> privileged. If you are not the intended addressee, or the person 
> responsible for delivering it to the intended addressee, you are 
> hereby notified that reading, disseminating, distributing or copying 
> this message is strictly prohibited. If you have received this message by 
> mistake, please immediately notify us by replying to the message and delete 
> the original message and any copies immediately thereafter.
>
> Thank you.-
> **
> 
> FAFLD
>


Indexing Halts for long time and then restarts

2012-03-16 Thread Husain, Yavar
Since Erick is really active answering now so posting a quick question :)

I am using:
DIH
Solr 3.5 on Windows

Building Auto Recommendation Utility

Having around 1 Billion Query Strings (3-6 words each) in database. Indexing 
them using NGram.

Merge Factor = 30
Auto Commit not set.

DIH halted after indexing 7 million for around 25 minutes and was not showing 
any increment in the Total Documents Processed/Fetched, ofcourse it was doing 
some stuff, was it some merge stuff?. After 25 minutes it started moving again.

Due to this indexing time has increased a lot. Any help will be appreciated.

Thanks.


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



Regarding Indexing Multiple Columns Best Practise

2012-03-15 Thread Husain, Yavar
Say I have around 30-40 fields (SQL Table Columns) indexed using Solr from the 
database. I concatenate those fields into one field by using Solr copyfield 
directive and than make it default search field which I search.

If at the database level itself I perform concatenation of all those fields 
into one field and then index that field directly (it will avoid copy operation 
of Solr of each field to that concatenated field) will it be a indexing 
performance improvement? I am sure it will be but will it make a big/huge 
change in indexing running time?

Thanks
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



RE: Solr out of memory exception

2012-03-15 Thread Husain, Yavar
Thanks a ton.

From: Li Li [fancye...@gmail.com]
Sent: Thursday, March 15, 2012 12:11 PM
To: Husain, Yavar
Cc: solr-user@lucene.apache.org
Subject: Re: Solr out of memory exception

it seems you are using 64bit jvm(32bit jvm can only allocate about 1.5GB). you 
should enable pointer compression by -XX:+UseCompressedOops

On Thu, Mar 15, 2012 at 1:58 PM, Husain, Yavar 
mailto:yhus...@firstam.com>> wrote:
Thanks for helping me out.

I have allocated Xms-2.0GB Xmx-2.0GB

However i see Tomcat is still using pretty less memory and not 2.0G

Total Memory on my Windows Machine = 4GB.

With smaller index size it is working perfectly fine. I was thinking of 
increasing the system RAM & tomcat heap space allocated but then how come on a 
different server with exactly same system and solr configuration & memory it is 
working fine?


-Original Message-
From: Li Li [mailto:fancye...@gmail.com<mailto:fancye...@gmail.com>]
Sent: Thursday, March 15, 2012 11:11 AM
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Subject: Re: Solr out of memory exception

how many memory are allocated to JVM?

On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar 
mailto:yhus...@firstam.com>> wrote:

> Solr is giving out of memory exception. Full Indexing was completed fine.
> Later while searching maybe when it tries to load the results in memory it
> starts giving this exception. Though with the same memory allocated to
> Tomcat and exactly same solr replica on another server it is working
> perfectly fine. I am working on 64 bit software's including Java & Tomcat
> on Windows.
> Any help would be appreciated.
>
> Here are the logs:
>
> The server encountered an internal error (Severe errors in solr
> configuration. Check your log files for more detailed information on what
> may be wrong. If you want solr to continue after configuration errors,
> change: false in
> null -
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at
> org.apache.solr.core.SolrCore.(SolrCore.java:579) at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
> at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
> at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
> at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
> at
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115)
> at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
> at
> org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
> at
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
> at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
> at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601) at
> org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943) at
> org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778) at
> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504) at
> org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
> at
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
> at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065) at
> org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) at
> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) at
> org.apache.catalina.core.StandardService.start(StandardService.java:525) at
> org.apache.catalina.core.StandardServer.start(StandardServer.java:754) at
> org.apache.catalina.startup.Catalina.start(Catalina.java:595) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
> java.lang.reflect.Method.invoke(Unknown Source) at
> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at
> org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by:
> java.lang.OutOfMemoryError: Java heap space at
> org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:180)
> at org.apache.lucene.index.TermInfosReader.(TermInfosReader.java:91)
> at
> org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:

RE: Solr out of memory exception

2012-03-14 Thread Husain, Yavar
Thanks for helping me out.

I have allocated Xms-2.0GB Xmx-2.0GB

However i see Tomcat is still using pretty less memory and not 2.0G

Total Memory on my Windows Machine = 4GB.

With smaller index size it is working perfectly fine. I was thinking of 
increasing the system RAM & tomcat heap space allocated but then how come on a 
different server with exactly same system and solr configuration & memory it is 
working fine?


-Original Message-
From: Li Li [mailto:fancye...@gmail.com] 
Sent: Thursday, March 15, 2012 11:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr out of memory exception

how many memory are allocated to JVM?

On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar  wrote:

> Solr is giving out of memory exception. Full Indexing was completed fine.
> Later while searching maybe when it tries to load the results in memory it
> starts giving this exception. Though with the same memory allocated to
> Tomcat and exactly same solr replica on another server it is working
> perfectly fine. I am working on 64 bit software's including Java & Tomcat
> on Windows.
> Any help would be appreciated.
>
> Here are the logs:
>
> The server encountered an internal error (Severe errors in solr
> configuration. Check your log files for more detailed information on what
> may be wrong. If you want solr to continue after configuration errors,
> change: false in
> null -
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at
> org.apache.solr.core.SolrCore.(SolrCore.java:579) at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
> at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
> at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
> at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
> at
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115)
> at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
> at
> org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
> at
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
> at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
> at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601) at
> org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943) at
> org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778) at
> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504) at
> org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
> at
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
> at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065) at
> org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) at
> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) at
> org.apache.catalina.core.StandardService.start(StandardService.java:525) at
> org.apache.catalina.core.StandardServer.start(StandardServer.java:754) at
> org.apache.catalina.startup.Catalina.start(Catalina.java:595) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
> java.lang.reflect.Method.invoke(Unknown Source) at
> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at
> org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by:
> java.lang.OutOfMemoryError: Java heap space at
> org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:180)
> at org.apache.lucene.index.TermInfosReader.(TermInfosReader.java:91)
> at
> org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:122)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:652) at
> org.apache.lucene.index.SegmentReader.get(SegmentReader.java:613) at
> org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:104) at
> org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:27)
> at
> org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
> at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) at
> org.apa

ngram synonyms & dismax together

2012-03-05 Thread Husain, Yavar


I have ngram-indexed 2 fields (columns in the database) and the third one is my 
full text field. Now my default text field is the full text field and while 
querying I use dismax handler and specify in it both the ngrammed field with 
certain boost values and also full text field with a certain boost value. 

Problem for me if I dont use dismax and just search full text field(i.e. 
default field specified in schema) synonyms work correctly i.e. ca returns all 
results where california is there whereas if i use dismax ca is also searched 
in the ngrammed fields and return partial matches of the word ca and does not 
go at all in the synonym part.

I want to use synonyms in every case so how should I go about it?
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



Dismax weird behaior wrt defType

2012-03-02 Thread Husain, Yavar
A weird behavior with respect to "defType". Any clues will be appreciated.

Query 1: 
http://localhost:8085/solr/select/?q=abc&version=2.2&start=0&rows=10&indent=on&defType=dismax
 [defType with capital T -- does not fetch results]

Query 2: 
http://localhost:8085/solr/select/?q=abc&version=2.2&start=0&rows=10&indent=on&deftype=dismax
 [defType with small T -- perfect, results returned]

In the above queries I have removed the boosting part because of which I am 
using dismax. And also while placing boosting stuff it is the other way round, 
deftype returns with capital T and does not return with small T.


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



RE: Spelling Corrector Algorithm

2012-03-01 Thread Husain, Yavar
Thanks Robert. Yes thats right I can get some more accuracy if I use 
transposition in addition to substitution, insert and deletion.

From: Robert Muir [rcm...@gmail.com]
Sent: Thursday, March 01, 2012 9:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Spelling Corrector Algorithm

On Thu, Mar 1, 2012 at 6:43 AM, Husain, Yavar  wrote:
> Hi
>
> For spell checking component I set extendedResults to get the frequencies and 
> then select the word with the best frequency. I understand the spell check 
> algorithm based on Edit Distance. For an example:
>
> Query to Solr: Marien
>
> Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. 
> My dictionary here is based on indexed words.
>
> So I chose Market (more frequency) however which is wrong as my intent was 
> marine. Both have Edit Distance of 2.

If you don't mind using trunk, just use directspellchecker, since it
counts marine as an edit distance of 1 from marien (a transposition:
https://issues.apache.org/jira/browse/LUCENE-3662)

--
lucidimagination.com
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



RE: Spelling Corrector Algorithm

2012-03-01 Thread Husain, Yavar
Thanks James. I loved the last line in your mail  "But in the end, especially 
with 1-word queries, I doubt even the best algorithms are going to always 
accurately guess what the user wanted." Absolutely I agree to this; if it is a 
phrase (instead of single word) then probably we can apply some NLP stuff.

From: Dyer, James [james.d...@ingrambook.com]
Sent: Thursday, March 01, 2012 9:29 PM
To: solr-user@lucene.apache.org
Subject: RE: Spelling Corrector Algorithm

Yavar,

When you listed what the spell checker returns you put them in this order:

> Marine (Freq: 120), Market (Freq: 900) and others

Was "Marine" listed first, and then did you pick "Market" because you thought 
higher frequency is better?  If so, you probably have the right settings 
already but need to trust it and go with the first result.

If, on the other hand, the wrong suggestions truly are coming up first, you 
have 2 extension points:

1. You can change the comparator class.  The default one sorts by "score" 
(distance) first and then "frequency" to break ties.  There is also a 
pre-packed comparator that sorts just on frequency, or you can write your own 
(implementing Comparator) . But I 
doubt you'd want to change this one.  (see 
http://wiki.apache.org/solr/SpellCheckComponent#Custom_Comparators_and_the_Lucene_Spell_Checkers_.28IndexBasedSpellChecker.2C_FileBasedSpellChecker.2C_DirectSolrSpellChecker.29
 for more info)

2. You can change the distance metric.  The default uses Levenshtein distance, 
but there is also an implementation for Jaro-Winkler distance.  (see the 
wikipedia articles for these 2 if you want to know the subtle differences).  It 
almost seems to me that Jaro-Winkler might give you better results but you'd 
have to test.  See the example under 
http://wiki.apache.org/solr/SpellCheckComponent?highlight=%28distanceMeasure%29#Configuration
 for more information on how to configure this.

If neither distance measure works for you, you could try implementing your own 
by creating a class implementing 
"org.apache.lucene.search.spell.StringDistance", then specify your class for 
the "distanceMeasure" parameter.

Finally, there are some other possibly easy solutions to your problem you 
should test before going through the trouble of writing custom code:

1. Try a higher "spellcheck.count".  Even if you only want a couple of results 
the algorithm works better with this set >5 (10-20 might be optimial in some 
cases).
2. Use DirectSolrSpellChecker, if on 4.x.  This one is not influenced by 
"spellcheck.count", so you can truly set it to 1 if all you want is 1 result.
3. Use "spellcheck.collate=true" and set "spellcheck.maxCollationTries" to 
maybe 5 or 10.  This will try the various suggestions by querying the index 
along with any other query parameters (other keywords, filters, etc), letting 
you know which suggestions are going to truly return hits in context (and how 
many).
4. Try Jaro-Winkler (as mentioned above).

Hope this helps.  But in the end, especially with 1-word queries, I doubt even 
the best algorithms are going to always accurately guess what the user wanted.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Husain, Yavar [mailto:yhus...@firstam.com]
Sent: Thursday, March 01, 2012 5:43 AM
To: solr-user@lucene.apache.org
Subject: Spelling Corrector Algorithm

Hi

For spell checking component I set extendedResults to get the frequencies and 
then select the word with the best frequency. I understand the spell check 
algorithm based on Edit Distance. For an example:

Query to Solr: Marien

Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. 
My dictionary here is based on indexed words.

So I chose Market (more frequency) however which is wrong as my intent was 
marine. Both have Edit Distance of 2.

Now how can I improve this Algorithm to select marine instead of market (based 
on something more than edit distance and frequency stuff)?

Do I have to incorporate some "soundex" algorithms too?

I am looking for simple stuff which I can quickly implement.

I even tried using Peter Norvig's spell corrector Algorithm (which is great) 
but again I ran in same problems.


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistak

Spelling Corrector Algorithm

2012-03-01 Thread Husain, Yavar
Hi

For spell checking component I set extendedResults to get the frequencies and 
then select the word with the best frequency. I understand the spell check 
algorithm based on Edit Distance. For an example:

Query to Solr: Marien

Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. 
My dictionary here is based on indexed words.

So I chose Market (more frequency) however which is wrong as my intent was 
marine. Both have Edit Distance of 2.

Now how can I improve this Algorithm to select marine instead of market (based 
on something more than edit distance and frequency stuff)?

Do I have to incorporate some "soundex" algorithms too?

I am looking for simple stuff which I can quickly implement.

I even tried using Peter Norvig's spell corrector Algorithm (which is great) 
but again I ran in same problems.


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



Solr Basic Performance Testwith duplicated data

2012-02-10 Thread Husain, Yavar
Will testing Solr based on duplicated data in the database result in same 
performance statistics as compared to testing Solr with completely unique data? 
By test I mean routine performance tests like time to index, time to search 
etc. Will solr perform any kind of optimization that will result in different 
statistics for both duplicated data and unique data?


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



Solr Indexing Running Time 32bit vs 64bit

2012-01-23 Thread Husain, Yavar
I was running 32 bit Java (JDK, JRE & Tomcat) on my 64 bit Windows. For 
indexing I was not able to allocate more than 1.5GB Heap Space on my machine. 
Each time my tomcat process used to touch the upper bound (i.e. 1.5GB) very 
quickly so I thought of working on 64 bit Java/Tomcat. Now I dont see any 
substantial changes in the indexing running time (though I have increased the 
heap size now) , it is still taking the same time what it was taking while 
running on 32 bit software. However the only difference is now the memory 
consumed by Tomcat is extremely low i.e. now it is not touching that 1.5GB 
mark. What maybe the reason for this?


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



Solr Tomcat Maximum Heap Memory

2011-12-21 Thread Husain, Yavar
I know this is a Solr forum however my problem is related to Solr running on 
Tomcat running on Windows 64 bit OS.

I am running a 32 bit JVM on a 64 bit Windows 2008 Server. The max heap space I 
am able to allocate is around 1.5 GB though I have 10 GB of RAM on my system 
and there is no other process running.
I understand the limit of max 2GB of heap space that can be allocated on 
Windows for a process. However I have seen in the forums people state using Xmx 
upto 10G. How is this possible? If I move to Linux, can I get more heap space 
allocated to the process or is it related to JVM?

Simply put, how can I allocate atleast 8GB of RAM as Xmx to Tomcat on my 64 bit 
Windows. The tomcat crashes when I start. Please help.


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



RE: Solr sentiment analysis

2011-12-15 Thread Husain, Yavar
This is a generic Machine Learning question and is not related to Solr (for 
which this thread is). You can ask this question on Stackoverflow.com.
However one of the approaches: Just go through the chapter in O'reilly 
Programming Collective Intelligence on Non Negative Matrix Factorization. That 
might help you out. It's simple and concise.

-Original Message-
From: maha [mailto:mahab...@gmail.com] 
Sent: Friday, December 16, 2011 12:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr sentiment analysis

hai i am dng research in sentimental analysis.pls give your valuable
suggestions.how to start my research

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-sentiment-analysis-tp3151415p3590933.html
Sent from the Solr - User mailing list archive at Nabble.com.
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



SQL Server Solr RAM issue

2011-12-08 Thread Husain, Yavar
When I start solr indexing RAM taken by MS SQL Server 2008 R2 also keeps on 
increasing & initially from some 1GB it went on to 3.6 GB (when indexing was 
completed for just 1 Million records/5GB). I have set the responseBuffering 
parameter to adaptive in data-config.xml however it didn't help me out. Even on 
killing the Tomcat, SQL Server is consuming the same memory (i.e. 3.6 GB), 
should it not release the memory and also just wanted to know why so much of 
RAM consumption by SQL Server during Indexing process?






**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



RE: SolR - Index problems

2011-12-07 Thread Husain, Yavar

Hi Jiggy

When you query the index, what do you get in the tomcat logs? (Check that out 
in tomcat/logs directory)

How much of Heap memory have you allocated to Tomcat?

- Yavar


From: jiggy [new...@trash-mail.com]
Sent: Wednesday, December 07, 2011 9:53 PM
To: solr-user@lucene.apache.org
Subject: SolR - Index problems

Hello Guys,

i have a big problem. I have integrated solr to Magento EE. I have two solr
folder, one is in c:/tomcat 7.0/
and the other one is in my web-folder(c:/www/).

In the tomcat-folder is the data folder of solr, their are about 200 MB
index file(I think here are my datas from magento).
In the www-folder are the bin and conf folder of solr.

My problem is now, when i try a query in the solr admin page, i don't get
any result.

My questions are:
Is that right with the two folders?
And Why i dont get any results ?

I use windows server 2008 r2 and solr 1.4.1.

Can anybody help, i read the reference guide and some contributions in this
forum, but i dont got any result.

Sorry for my bad english.

Thanks in advance.

Best regards,

Jiggy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolR-Index-problems-tp3567883p3567883.html
Sent from the Solr - User mailing list archive at Nabble.com.
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



RE: Autocommit & Index Size

2011-12-06 Thread Husain, Yavar
Hi Shawn

Absolutely perfect. It is always great reading your answers again and again as 
you explain the concepts so very well. Three cheers and thanks for your reply.

Regards,
Yavar

From: Shawn Heisey [s...@elyograg.org]
Sent: Wednesday, December 07, 2011 12:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Autocommit & Index Size

On 12/6/2011 1:01 AM, Husain, Yavar wrote:
> In solrconfig.xml I was experimenting with Indexing Performance. When I set 
> the maxDocs (in autoCommit) to say 1 documents the index size is double 
> to if I just dont use autoCommit (i.e. keep it commented, i.e commit at the 
> end only after adding documents).
>
> Does autoCommit affect the index size and if so why (as it should only affect 
> performance) ?

In other messages you have mentioned using MSSQL as a data source, so I
am taking a wild guess and assuming that you are running on Windows.  If
that's not the case, then I don't know why you are seeing this behavior.

On Windows, the timing involved when index segments are merged results
in the old index segments being left on the disk, because they are still
open for reading when Solr tries to delete them, and on Windows you
cannot delete an open file.  Later merge/optimize passes will remove
files left over from the previous passes, but will themselves suffer
from the same problem.

I am guessing that when you don't have autoCommit turned on, Solr never
opens an IndexReader on the segments as they are created, so Windows has
no trouble deleting old segments after merging/optimizing.  With
autoCommit turned on, Solr opens the index segments and makes them
searchable after every commit.

Thanks,
Shawn

**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



Autocommit & Index Size

2011-12-06 Thread Husain, Yavar
In solrconfig.xml I was experimenting with Indexing Performance. When I set the 
maxDocs (in autoCommit) to say 1 documents the index size is double to if I 
just dont use autoCommit (i.e. keep it commented, i.e commit at the end only 
after adding documents).

Does autoCommit affect the index size and if so why (as it should only affect 
performance) ?

- Yavar



**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



RE: DIH Strange Problem

2011-11-28 Thread Husain, Yavar

Thanks Kai for sharing this. Ian encountered the same problem so marking him in 
the mail too.

From: Kai Gülzau [kguel...@novomind.com]
Sent: Monday, November 28, 2011 6:55 PM
To: solr-user@lucene.apache.org
Subject: RE: DIH Strange Problem

Do you use Java 6 update 29? There is a known issue with the latest mssql 
driver:

http://blogs.msdn.com/b/jdbcteam/archive/2011/11/07/supported-java-versions-november-2011.aspx

"In addition, there are known connection failure issues with Java 6 update 29, 
and the developer preview (non production) versions of Java 6 update 30 and 
Java 6 update 30 build 12.  We are in contact with Java on these issues and we 
will update this blog once we have more information."

Should work with update 28.

Kai

-Original Message-----
From: Husain, Yavar [mailto:yhus...@firstam.com]
Sent: Monday, November 28, 2011 1:02 PM
To: solr-user@lucene.apache.org; Shawn Heisey
Subject: RE: DIH Strange Problem

I figured out the solution and Microsoft and not Solr is the problem here :):

I downloaded and build latest Solr (3.4) from sources and finally hit following 
line of code in Solr (where I put my debug statement) :

if(url != null){
   LOG.info("Yavar: getting handle to driver manager:");
   c = DriverManager.getConnection(url, initProps);
   LOG.info("Yavar: got handle to driver manager:"); }

The call to Driver Manager was not returning. Here was the error!! The Driver 
we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded 
another driver called jTDS jDBC driver and installed that. Problem got fixed!!!

So please follow the following steps:

1. Download jTDS jDBC driver from http://jtds.sourceforge.net/ 2. Put the 
driver jar file into your Solr/lib directory where you had put Microsoft JDBC 
driver.
3. In the data-config.xml use this statement: 
driver="net.sourceforge.jtds.jdbc.Driver"
4. Also in data-config.xml mention url like this: 
"url="jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX"
5. Now run your indexing.

It should solve the problem.

-Original Message-
From: Husain, Yavar
Sent: Thursday, November 24, 2011 12:38 PM
To: solr-user@lucene.apache.org; Shawn Heisey
Subject: RE: DIH Strange Problem

Hi

Thanks for your replies.

I carried out these 2 steps (it did not solve my problem):

1. I tried setting responseBuffering to adaptive. Did not work.
2. For checking Database connection I wrote a simple java program to connect to 
database and fetch some results with the same driver that I use for solr. It 
worked. So it does not seem to be a problem with the connection.

Now I am stuck where Tomcat log says: "Creating a connection for entity ." 
and does nothing, I mean after this log we usually get the "getConnection() 
took x millisecond" however I dont get that ,I can just see the time moving 
with no records getting fetched.

Original Problem listed again:


I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing 
data. Indexing and all was working perfectly fine. However today when I started 
full indexing again, Solr halts/stucks at the line "Creating a connection for 
entity." There are no further messages after that. I can see that DIH 
is busy and on the DIH console I can see "A command is still running", I can 
also see total rows fetched = 0 and total request made to datasource = 1 and 
time is increasing however it is not doing anything. This is the exact 
configuration that worked for me. I am not really able to understand the 
problem here. Also in the index directory where I am storing the index there 
are just 3 files: 2 segment files + 1  lucene*-write.lock file.
...
data-config.xml:

  .
.

Logs:

INFO: Server startup in 2016 ms
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 
QTime=11 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX Nov 23, 2011 4:11:27 PM 
org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6]
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1322041133719
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity SampleText with URL: 
jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders


-Original Message--

RE: DIH Strange Problem

2011-11-28 Thread Husain, Yavar
I figured out the solution and Microsoft and not Solr is the problem here :):

I downloaded and build latest Solr (3.4) from sources and finally hit following 
line of code in Solr (where I put my debug statement) :

if(url != null){
   LOG.info("Yavar: getting handle to driver manager:");
   c = DriverManager.getConnection(url, initProps);
   LOG.info("Yavar: got handle to driver manager:");
}

The call to Driver Manager was not returning. Here was the error!! The Driver 
we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded 
another driver called jTDS jDBC driver and installed that. Problem got fixed!!!

So please follow the following steps:

1. Download jTDS jDBC driver from http://jtds.sourceforge.net/
2. Put the driver jar file into your Solr/lib directory where you had put 
Microsoft JDBC driver.
3. In the data-config.xml use this statement: 
driver="net.sourceforge.jtds.jdbc.Driver"
4. Also in data-config.xml mention url like this: 
"url="jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX"
5. Now run your indexing.

It should solve the problem.

-Original Message-
From: Husain, Yavar 
Sent: Thursday, November 24, 2011 12:38 PM
To: solr-user@lucene.apache.org; Shawn Heisey
Subject: RE: DIH Strange Problem

Hi

Thanks for your replies.

I carried out these 2 steps (it did not solve my problem):

1. I tried setting responseBuffering to adaptive. Did not work.
2. For checking Database connection I wrote a simple java program to connect to 
database and fetch some results with the same driver that I use for solr. It 
worked. So it does not seem to be a problem with the connection.

Now I am stuck where Tomcat log says: "Creating a connection for entity ." 
and does nothing, I mean after this log we usually get the "getConnection() 
took x millisecond" however I dont get that ,I can just see the time moving 
with no records getting fetched.

Original Problem listed again:


I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing 
data. Indexing and all was working perfectly fine. However today when I started 
full indexing again, Solr halts/stucks at the line "Creating a connection for 
entity." There are no further messages after that. I can see that DIH 
is busy and on the DIH console I can see "A command is still running", I can 
also see total rows fetched = 0 and total request made to datasource = 1 and 
time is increasing however it is not doing anything. This is the exact 
configuration that worked for me. I am not really able to understand the 
problem here. Also in the index directory where I am storing the index there 
are just 3 files: 2 segment files + 1  lucene*-write.lock file. 
...
data-config.xml:



.
.

Logs:

INFO: Server startup in 2016 ms
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 
QTime=11 
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6]
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1322041133719
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity SampleText with URL: 
jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Wednesday, November 23, 2011 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH Strange Problem

On 11/23/2011 5:21 AM, Chantal Ackermann wrote:
> Hi Yavar,
>
> my experience with similar problems was that there was something wrong
> with the database connection or the database.
>
> Chantal

It's also possible that your JDBC driver might be trying to buffer the 
entire result set.  There's a link on the wiki specifically for this 
problem on MS SQL server.  Hopefully it's that, but Chantal could be 
right too.

http://wiki.apache.org/solr/DataImportHandlerFaq

Here's the URL to the specific paragraph, but it's likely that it won't 
survive the email trip in a clickable form:

http://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_MS_SQL_Server_database_with_sqljdbc_driver._DataImportHandler_is_going_o

RE: Unable to index documents using DataImportHandler with MSSQL

2011-11-28 Thread Husain, Yavar
Hi Ian

I downloaded and build latest Solr (3.4) from sources and finally hit following 
line of code in Solr (where I put my debug statement) :

if(url != null){
   LOG.info("Yavar: getting handle to driver manager:");
   c = DriverManager.getConnection(url, initProps);
   LOG.info("Yavar: got handle to driver manager:");
}

The call to Driver Manager was not returning. Here was the error!! The Driver 
we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded 
another driver called jTDS jDBC driver and installed that. Problem got fixed!!!

So please follow the following steps:

1. Download jTDS jDBC driver from http://jtds.sourceforge.net/
2. Put the driver jar file into your Solr/lib directory where you had put 
Microsoft JDBC driver.
3. In the data-config.xml use this statement: 
driver="net.sourceforge.jtds.jdbc.Driver"
4. Also in data-config.xml mention url like this: 
"url="jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX"
5. Now run your indexing.

It should solve the problem.

Regards,
Yavar

-Original Message-
From: Ian Grainger [mailto:i...@isfluent.com] 
Sent: Monday, November 28, 2011 4:11 PM
To: Husain, Yavar
Cc: solr-user@lucene.apache.org
Subject: Re: Unable to index documents using DataImportHandler with MSSQL

Right.
This is REALLY weird - I've now started from scratch on another
machine (this time Windows 7), and got _exactly_ the same problem !?


On Mon, Nov 28, 2011 at 7:37 AM, Husain, Yavar  wrote:
> Hi Ian
>
> I am having exactly the same problem what you are having on Win 7 and 2008 
> Server http://lucene.472066.n3.nabble.com/DIH-Strange-Problem-tc3530370.html
>
> I still have not received any replies which could solve my problem till now. 
> Please do let me know if you have arrived at some solution for your problem.
>
> Thanks.
>
> Regards,
> Yavar
>
> -Original Message-
> From: Ian Grainger [mailto:i...@isfluent.com]
> Sent: Friday, November 25, 2011 10:59 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Unable to index documents using DataImportHandler with MSSQL
>
> Update on this: I've established:
> * It's not a problem in the DB (I can index from this DB into a Solr
> instance on another server)
> * It's not Tomcat (I get the same problem in Jetty)
> * It's not the schema (I have simplified it to one field)
>
> That leaves SolrConfig.xml and data-config.
>
> Only thing changed in SolrConfig.xml is adding:
>
>   regex="apache-solr-cell-\d.*\.jar" />
>   regex="apache-solr-clustering-\d.*\.jar" />
>   regex="apache-solr-dataimporthandler-\d.*\.jar" />
>     class="org.apache.solr.handler.dataimport.DataImportHandler">
>  
>    D:/Software/Solr/example/solr/conf/data-config.xml
>  
> 
>
> And data-config.xml is pretty much as attached - except simpler.
>
> Any help or any advice on how to diagnose would be appreciated!
>
>
> On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger  wrote:
>> Hi I have copied my Solr config from a working Windows server to a new
>> one, and it can't seem to run an import.
>>
>> They're both using win server 2008 and SQL 2008R2. This is the data
>> importer config
>>
>>    
>>      >            driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
>>            url="jdbc:sqlserver://localhost;databaseName=DB"
>>            user="Solr"
>>            password="pwd"/>
>>      
>>        >        query="EXEC SOLR_COMPANY_SEARCH_DATA"
>>        deltaImportQuery="SELECT * FROM Company_Search_Data WHERE
>> [key]='${dataimporter.delta.key}'"
>>        deltaQuery="SELECT [key] FROM Company_Search_Data WHERE modify_dt
>>> '${dataimporter.last_index_time}'">
>>              > name="WorkDesc_Comments_Split" />
>>              > />
>>        
>>      
>>    
>>
>> I can use MS SQL Profiler to watch the Solr user log in successfully,
>> but then nothing. It doesn't seem to even try and execute the stored
>> procedure. Any ideas why this would be working one server and not on
>> another?
>>
>> FTR the only thing in the tomcat catalina log is:
>>
>>    org.apache.solr.handler.dataimport.JdbcDataSource$1 call
>>    INFO: Creating a connection for entity data with URL:
>> jdbc:sqlserver://localhost;databaseName=CATLive
>>
>> --
>> Ian
>>
>> i...@isfluent.com
>> +44 (0)1223 257903
>>
>
>
>
> --
> Ian
>
> i...@isfluent.com
> +44 (0)1223 257903
>

RE: Unable to index documents using DataImportHandler with MSSQL

2011-11-27 Thread Husain, Yavar
Hi Ian

I am having exactly the same problem what you are having on Win 7 and 2008 
Server http://lucene.472066.n3.nabble.com/DIH-Strange-Problem-tc3530370.html

I still have not received any replies which could solve my problem till now. 
Please do let me know if you have arrived at some solution for your problem. 

Thanks.

Regards,
Yavar

-Original Message-
From: Ian Grainger [mailto:i...@isfluent.com] 
Sent: Friday, November 25, 2011 10:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Unable to index documents using DataImportHandler with MSSQL

Update on this: I've established:
* It's not a problem in the DB (I can index from this DB into a Solr
instance on another server)
* It's not Tomcat (I get the same problem in Jetty)
* It's not the schema (I have simplified it to one field)

That leaves SolrConfig.xml and data-config.

Only thing changed in SolrConfig.xml is adding:

  
  
  

  
D:/Software/Solr/example/solr/conf/data-config.xml
  


And data-config.xml is pretty much as attached - except simpler.

Any help or any advice on how to diagnose would be appreciated!


On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger  wrote:
> Hi I have copied my Solr config from a working Windows server to a new
> one, and it can't seem to run an import.
>
> They're both using win server 2008 and SQL 2008R2. This is the data
> importer config
>
>    
>                  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
>            url="jdbc:sqlserver://localhost;databaseName=DB"
>            user="Solr"
>            password="pwd"/>
>      
>                query="EXEC SOLR_COMPANY_SEARCH_DATA"
>        deltaImportQuery="SELECT * FROM Company_Search_Data WHERE
> [key]='${dataimporter.delta.key}'"
>        deltaQuery="SELECT [key] FROM Company_Search_Data WHERE modify_dt
>> '${dataimporter.last_index_time}'">
>               name="WorkDesc_Comments_Split" />
>               />
>        
>      
>    
>
> I can use MS SQL Profiler to watch the Solr user log in successfully,
> but then nothing. It doesn't seem to even try and execute the stored
> procedure. Any ideas why this would be working one server and not on
> another?
>
> FTR the only thing in the tomcat catalina log is:
>
>    org.apache.solr.handler.dataimport.JdbcDataSource$1 call
>    INFO: Creating a connection for entity data with URL:
> jdbc:sqlserver://localhost;databaseName=CATLive
>
> --
> Ian
>
> i...@isfluent.com
> +44 (0)1223 257903
>



-- 
Ian

i...@isfluent.com
+44 (0)1223 257903
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



RE: DIH Strange Problem

2011-11-23 Thread Husain, Yavar
Hi

Thanks for your replies.

I carried out these 2 steps (it did not solve my problem):

1. I tried setting responseBuffering to adaptive. Did not work.
2. For checking Database connection I wrote a simple java program to connect to 
database and fetch some results with the same driver that I use for solr. It 
worked. So it does not seem to be a problem with the connection.

Now I am stuck where Tomcat log says: "Creating a connection for entity ." 
and does nothing, I mean after this log we usually get the "getConnection() 
took x millisecond" however I dont get that ,I can just see the time moving 
with no records getting fetched.

Original Problem listed again:


I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing 
data. Indexing and all was working perfectly fine. However today when I started 
full indexing again, Solr halts/stucks at the line "Creating a connection for 
entity." There are no further messages after that. I can see that DIH 
is busy and on the DIH console I can see "A command is still running", I can 
also see total rows fetched = 0 and total request made to datasource = 1 and 
time is increasing however it is not doing anything. This is the exact 
configuration that worked for me. I am not really able to understand the 
problem here. Also in the index directory where I am storing the index there 
are just 3 files: 2 segment files + 1  lucene*-write.lock file. 
...
data-config.xml:



.
.

Logs:

INFO: Server startup in 2016 ms
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 
QTime=11 
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6]
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1322041133719
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity SampleText with URL: 
jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Wednesday, November 23, 2011 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH Strange Problem

On 11/23/2011 5:21 AM, Chantal Ackermann wrote:
> Hi Yavar,
>
> my experience with similar problems was that there was something wrong
> with the database connection or the database.
>
> Chantal

It's also possible that your JDBC driver might be trying to buffer the 
entire result set.  There's a link on the wiki specifically for this 
problem on MS SQL server.  Hopefully it's that, but Chantal could be 
right too.

http://wiki.apache.org/solr/DataImportHandlerFaq

Here's the URL to the specific paragraph, but it's likely that it won't 
survive the email trip in a clickable form:

http://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_MS_SQL_Server_database_with_sqljdbc_driver._DataImportHandler_is_going_out_of_memory._I_tried_adjustng_the_batchSize_values_but_they_don.27t_seem_to_make_any_difference._How_do_I_fix_this.3F

Thanks,
Shawn

**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD


DIH Strange Problem

2011-11-23 Thread Husain, Yavar

I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing 
data. Indexing and all was working perfectly fine. However today when I started 
full indexing again, Solr halts/stucks at the line "Creating a connection for 
entity." There are no further messages after that. I can see that DIH 
is busy and on the DIH console I can see "A command is still running", I can 
also see total rows fetched = 0 and total request made to datasource = 1 and 
time is increasing however it is not doing anything. This is the exact 
configuration that worked for me. I am not really able to understand the 
problem here. Also in the index directory where I am storing the index there 
are just 3 files: 2 segment files + 1  lucene*-write.lock file.
...
data-config.xml:



.
.

Logs:

INFO: Server startup in 2016 ms
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 
QTime=11
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
   
commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6]
Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1322041133719
Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity SampleText with URL: 
jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



RE: Solr Performance/Architecture

2011-11-22 Thread Husain, Yavar
Hi Shawn

That was so great of you to explain the architecture in such a detail. I 
enjoyed reading it multiple times.

I have a question here:

You mentioned that we can use crc32(DocumentId)% NumServers. Now actually I am 
using that in my data-config.xml in the sql query itself, something like:

For Documents to be indexed on Server 1: select DocumentId,PNum,... from Sample 
where crc32(DocumentId)%2=0;
For Documents to be indexed on Server 2: select DocumentId,PNum,... from Sample 
where crc32(DocumentId)%2=1;

Will that be a right way? Will it not be a slow query?

Thanks once again.



-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Monday, November 21, 2011 7:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Performance/Architecture

On 11/21/2011 12:41 AM, Husain, Yavar wrote:
> Number of rows in SQL Table (Indexed till now using Solr): 1 million
> Total Size of Data in the table: 4GB
> Total Index Size: 3.5 GB
>
> Total Number of Rows that I have to index: 20 Million (approximately 100 GB 
> Data) and growing
>
> What is the best practices with respect to distributing the index? What I 
> mean to say here is when should I distribute and what is the magic number 
> that I can have for index size per instance?
>
> For 1 million itself Solr instance running on a VM is taking roughly 2.5 hrs 
> to index for me. So for 20 million roughly it would take 60 -70 hrs. That 
> would be too much.
>
> What would be the best distributed architecture for my case? It will be great 
> if people may share their best practices and experience.

I have a MySQL database with 66 million rows at the moment, always 
growing.  My Solr index is split into six large shards and a small shard 
with the newest data.  The small shard (incremental) is calculated by 
looking at counts of data in hourly increments between 7 and 3.5 days 
old, and either choosing a boundary that results in less than 500,000 
documents or the 3.5 day boundary.  This index is usually about 1GB in size.

The rest of the documents are split between the other six shards using 
crc32(did) % 6.  The did field is a mysql bigint autoincrement field.  
These large shards are very close to 11 million records and 20GB each.  
By indexing all six at once, I can complete a full index rebuild in 
about 3.5 hours.

Each full index chain lives on two 64GB Dell servers with dual quad-core 
processors.  Each server contains a Solr instance with 8GB of heap, 
running three large shards.  One server contains the incremental index, 
the other server runs the load balancer.  Both servers run an index-free 
Solr core that we call the broker.  Its search handlers have the shards 
parameter in solrconfig.xml, pointed at the appropriate cores for that 
index chain.

To keep index size down and search speed up, it's important that your 
index only contain the fields needed for two purposes: Searching 
(indexed fields) and displaying a results grid (stored fields).  Any 
other information should be excluded from your schema.xml and/or DIH 
config.  Full item details should be populated from the database or 
other information store (possibly a filesystem), using the unique 
identifier from the search results.

If you are aggregating data from more than one table, see if you can 
have your database get the information into one SELECT statement with 
JOINs, rather than having more than one entity in your DIH config.  
Alternatively, if your secondary tables are small, try using the 
CachedSQLEntityProcessor on them so they are loaded entirely into RAM 
for the import.  Your database software is usually much better at 
combining tables than Solr, so take advantage of it.

If you have multivalued search fields from secondary entities in DIH, 
you can often get your database software to CONCAT them together into a 
single field, then use an appropriate tokenizer to split them into 
separate terms.  I have one such field that is semicolon separated by a 
database JOIN that's specified in a view, then I use a pattern tokenizer 
that splits it at index time.

I hope this is helpful.

Thanks,
Shawn

**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



Solr Performance/Architecture

2011-11-20 Thread Husain, Yavar

Number of rows in SQL Table (Indexed till now using Solr): 1 million
Total Size of Data in the table: 4GB
Total Index Size: 3.5 GB

Total Number of Rows that I have to index: 20 Million (approximately 100 GB 
Data) and growing

What is the best practices with respect to distributing the index? What I mean 
to say here is when should I distribute and what is the magic number that I can 
have for index size per instance?

For 1 million itself Solr instance running on a VM is taking roughly 2.5 hrs to 
index for me. So for 20 million roughly it would take 60 -70 hrs. That would be 
too much.

What would be the best distributed architecture for my case? It will be great 
if people may share their best practices and experience.

Thanks!!


**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



write-lock issue

2011-11-18 Thread Husain, Yavar
Environment: Solr 1.4 on Windows/MS SQL Server

A write lock is getting created whenever I am trying to do a full-import of 
documents using DIH. Logs say "Creating a connection with the database." 
and the process is not going forward (Not getting a database connection). So 
the indexes are not getting created. Note that no other process is accessing 
the index and even I restarted my MS SQL Server service. However still I see a 
write.lock file in my index directory.

What could be the reason for this? Even I have set the flag unlockOnStartup in 
solrconfig to be true, still the indexing is not happening.




**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



Solr Indexing Time

2011-11-10 Thread Husain, Yavar
Solr 1.4 is doing great with respect to Indexing on a dedicated physical server 
(Windows Server 2008). For Indexing around 1 million full text documents 
(around 4 GB size) it takes around 20 minutes with Heap Size = 512M - 1G & 4GB 
RAM.



However while using Solr on a VM, with 4 GB RAM it took 50 minutes to index at 
the first time. Note that there is no Network delays and no RAM issues. Now 
when I increased the RAM to 8GB and increased the heap size, the indexing time 
increased to 2 hrs. That was really strange. Note that except for SQL Server 
there is no other process running. There are no network delays. However I have 
not checked for File I/O. Can that be a bottleneck? Does Solr has any issues 
running in "Virtualization" Environment?



I read a paper today by Brian & Harry: "ON THE RESPONSE TIME OF A SOLR SEARCH 
ENGINE IN A VIRTUALIZED ENVIRONMENT" & they claim that performance gets 
deteriorated when RAM is increased when Solr is running on a VM but that is 
with respect to query times and not indexing times.



I am bit confused as to why it took longer on a VM when I repeated the same 
test second time with increased heap size and RAM.



**This
 message may contain confidential or proprietary information intended only for 
the use of theaddressee(s) named above or may contain information that is 
legally privileged. If you arenot the intended addressee, or the person 
responsible for delivering it to the intended addressee,you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictlyprohibited. If you have received this message by mistake, please 
immediately notify us byreplying to the message and delete the original 
message and any copies immediately thereafter.

Thank you.~
**
FAFLD



Solr Indexing Time varying each time I index

2011-11-10 Thread Husain, Yavar
Solr 1.4 is doing great with respect to Indexing on a dedicated physical server 
(Windows Server 2008). For Indexing around 1 million full text documents 
(around 4 GB size) it takes around 20 minutes with Heap Size = 512M - 1G & 4GB 
RAM.

However while using Solr on a VM, with 4 GB RAM it took 50 minutes to index at 
the first time. Note that there is no Network delays and no RAM issues. Now 
when I increased the RAM to 8GB and increased the heap size, the indexing time 
increased to 2 hrs. That was really strange. Note that except for SQL Server 
there is no other process running. There are no network delays. However I have 
not checked for File I/O. Can that be a bottleneck? Does Solr has any issues 
running in "Virtualization" Environment? 

I read a paper today by Brian & Harry: "ON THE RESPONSE TIME OF A SOLR SEARCH 
ENGINE IN A VIRTUALIZED ENVIRONMENT" & they claim that performance gets 
deteriorated when RAM is increased when Solr is running on a VM but that is 
with respect to query times and not indexing times. 

I am bit confused as to why it took longer on a VM when I repeated the same 
test second time with increased heap size and RAM.
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD