Re: Coming back to search after some time... SOLR or Elastic for text search?

2020-01-15 Thread Dc Tech
Thank you Jan and Charlie. 

I should say that in terms of posting to the community regarding Elastic vs 
Solr - this is probably the most civil and helpful community that I have been a 
part of - and your answers have only reinforced that  notion !!

Thank you for your responses. I am glad to hear that both can do most of it, 
which was my gut feeling as well. 

Charlie, to your point - the team probably feels that Elastic  is easier to get 
started with hence the preference, as well as the hosting options (with the 
caveats you noted). Agree with you completely that tech is not the real issue. 

Jan,  agree with  the points you made on team skills.  On our previous 
proprietary engine - that was in fact the biggest issue - the engine was 
powerful enough and had good references.  However, we were not able to exploit 
it to good effect.  

Thank you again. 

> 
> On Jan 15, 2020, at 5:10 AM, Jan Høydahl  wrote:
> 
> Hi,
> 
> Choosing the solr community mailing list to ask advice for whether to choose 
> ES - you already know what to expect, not?
> More often than not the choice comes down to policy, standardization, what 
> skills you have in the house etc rather than ticking off feature checkboxes.
> Sometimes company values also may drive a choice, i.e. Solr is 100% Apache 
> and not open core, which may matter if you plan to get involved in the 
> community, and contribute features or patches.
> 
> However, if I were in your shoes as architect to evaluate tech stack, and 
> there was not a clear choice based on the above, I’d do what projects 
> normally do, to ask yourself what you really need from the engine. Maybe you 
> have some features in your requirement list that makes one a much better 
> choice over the other. Or maybe after that exercise you are still wondering 
> what to choose, in which case you just follow your gut feeling and make a 
> choice :)
> 
> Jan
> 
>> 15. jan. 2020 kl. 10:07 skrev Charlie Hull :
>> 
>>> On 15/01/2020 04:02, Dc Tech wrote:
>>> I am SOLR fant and had implemented it in our company over 10 years ago.
>>> I moved away from that role and the new search team in the meanwhile
>>> implemented a proprietary (and expensive) nosql style search engine. That
>>> the project did not go well, and now I am back to project and reviewing the
>>> technology stack.
>>> 
>>> Some of the team think that ElasticSearch could be a good option,
>>> especially since we can easily get hosted versions with AWS where we have
>>> all the contractual stuff sorted out.
>> You can, but you should be aware that:
>> 1. Amazon's hosted Elasticsearch isn't great, often lags behind the current 
>> version, doesn't allow plugins etc.
>> 2.  Amazon and Elastic are currently engaged in legal battles over who is 
>> the most open sourcey,who allegedly copied code that was 'open' but 
>> commercially licensed, who would like to capture the hosted search 
>> market...not sure how this will pan out (Google for details)
>> 3. You can also buy fully hosted Solr from several places.
>>> Whle SOLR definitely seems more advanced  (LTR, streaming expressions,
>>> graph, and all the knobs and dials for relevancy tuning), Elastic may be
>>> sufficient for our needs. It does not seem to have LTR out of the box but
>>> the relevancy tuning knobs and dials seem to be similar to what SOLR has.
>> Yes, they're basically the same under the hood (unsurprising as they're both 
>> based on Lucene). If you need LTR there's an ES plugin for that (disclaimer, 
>> my new employer built and maintains it: 
>> https://github.com/o19s/elasticsearch-learning-to-rank). I've lost track of 
>> the amount of times I've been asked 'Elasticsearch or Solr, which should I 
>> choose?' and my current thoughts are:
>> 
>> 1. Don't switch from one to the other for the sake of it.  Switching search 
>> engines rarely addresses underlying issues (content quality, team skills, 
>> relevance tuning methodology)
>> 2. Elasticsearch is easier to get started with, but at some point you'll 
>> need to learn how it all works
>> 3. Solr is harder to get started with, but you'll know more about how it all 
>> works earlier
>> 4. Both can be used for most search projects, most features are the same, 
>> both can scale.
>> 5. Lots of Elasticsearch projects (and developers) are focused on logs, 
>> which is often not really a 'search' project.
>> 
>>> 
>>> The corpus size is not a challenge  - we have about one million document,
>>> of which about 1/2 have full text

Coming back to search after some time... SOLR or Elastic for text search?

2020-01-14 Thread Dc Tech
I am SOLR fant and had implemented it in our company over 10 years ago.
I moved away from that role and the new search team in the meanwhile
implemented a proprietary (and expensive) nosql style search engine. That
the project did not go well, and now I am back to project and reviewing the
technology stack.

Some of the team think that ElasticSearch could be a good option,
especially since we can easily get hosted versions with AWS where we have
all the contractual stuff sorted out.

Whle SOLR definitely seems more advanced  (LTR, streaming expressions,
graph, and all the knobs and dials for relevancy tuning), Elastic may be
sufficient for our needs. It does not seem to have LTR out of the box but
the relevancy tuning knobs and dials seem to be similar to what SOLR has.

The corpus size is not a challenge  - we have about one million document,
of which about 1/2 have full text, while the test are simpler (i.e. company
directory etc.).
The query volumes are also quite low (max 5/second at peak).
We have implemented the content ingestion and processing pipelines already
in python and SPARK, so most of the data will be pushed in using APIs.

I would really appreciate any guidance from the community !!


Re: Nested documents vs. flattening document structure?

2018-03-06 Thread Dc Tech
Thank  you Erick.
That was my instinct as well.



On Tue, Mar 6, 2018 at 10:05 AM, Erick Erickson 
wrote:

> Flattening the nested documents is usually preferred if at all
> possible. Nested documents to, indeed, have a series of restrictions
> that often make them harder to work with than flattened docs.
>
> Best,
> Erick
>
> On Tue, Mar 6, 2018 at 6:48 AM, Dc Tech  wrote:
> > We are evaluating using nested documents vs. simply flattening the
> document.
> >
> > Looking through the documentation, it is not very clear to me if the
> nested
> > documents are fully mature, and support the full richness  of SOLR
> > (streaming, mature faceting) etc...
> >
> > Any opinions or guidance on that?
> >
> >
> > For *flattening*, we are thinking of setting up three groups of fields:
> > 1. Fields for search - 3-4 groups of fields that glom together the
> document
> > fields in order of boosting priority (e.g. f1 has just the title , f2 has
> > title+authors)
> > 2. Fields for faceting if needed
> > 3. and Fields for  display (or the original document fields) e.g.
> > author_name|author_unique_id...
>


Nested documents vs. flattening document structure?

2018-03-06 Thread Dc Tech
We are evaluating using nested documents vs. simply flattening the document.

Looking through the documentation, it is not very clear to me if the nested
documents are fully mature, and support the full richness  of SOLR
(streaming, mature faceting) etc...

Any opinions or guidance on that?


For *flattening*, we are thinking of setting up three groups of fields:
1. Fields for search - 3-4 groups of fields that glom together the document
fields in order of boosting priority (e.g. f1 has just the title , f2 has
title+authors)
2. Fields for faceting if needed
3. and Fields for  display (or the original document fields) e.g.
author_name|author_unique_id...


Re: Boost parameter with query function - how to pass in complex params?

2013-04-07 Thread dc tech
Yonik:
Pasted the wrong URL as I was trying various things.

I did not work with OR
http://localhost:8983/solr/cars/select?fl=text,score&defType=edismax&q=suv&boost=query($boostq,1)&boostq=toyota%20OR%20honda&debug=true

See dumps below.

Many thanks.


INPUT

0
882

toyota OR honda
text,score
suv
query($boostq,1)
true
edismax



DEBUG:

suv
suv

BoostedQuery(boost(+(text:suv),query(text:toyota text:honda,def=1.0)))


boost(+(text:suv),query(text:toyota text:honda,def=1.0))






On Sun, Apr 7, 2013 at 9:07 AM, Yonik Seeley  wrote:

> On Sun, Apr 7, 2013 at 8:39 AM, dc tech  wrote:
> > Yonik,
> > Many thanks.
> > The OR is still not working... here is the full URL
> > 1. Honda or Toyota individually work
> >
> http://localhost:8983/solr/cars/select?fl=text,score&defType=edismax&q=suv&boost=query($boostq,1)&boostq=honda
> >
> http://localhost:8983/solr/cars/select?fl=text,score&defType=edismax&q=suv&boost=query($boostq,1)&boostq=toyota
> > I can see the scores increasing on the matching models.
> >
> > 2. But the OR does not work
> >
> http://localhost:8983/solr/cars/select?fl=text,score&defType=edismax&q=suv&boost=query($boostq,1)&boostq=toyota%20or%20honda
>
> I still see a lowercase "or" in there that should be uppercase.
>
> You can also add debug=query to see exactly what query is generated.
>
> -Yonik
> http://lucidworks.com
>


FYI - Excel to generate schema and SolrConfig

2013-04-07 Thread dc tech
To minimize my own typing when setting up SOLR schema or config, I created
a simple Excel that can reduce the amount of typing that is required.

Please feel free to use it if you find it useful.


solr_schema_shared.xlsx
Description: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet


Re: Boost parameter with query function - how to pass in complex params?

2013-04-07 Thread dc tech
Yonik,
Many thanks.
The OR is still not working... here is the full URL
1. Honda or Toyota individually work
http://localhost:8983/solr/cars/select?fl=text,score&defType=edismax&q=suv&boost=query($boostq,1)&boostq=honda
http://localhost:8983/solr/cars/select?fl=text,score&defType=edismax&q=suv&boost=query($boostq,1)&boostq=toyota
I can see the scores increasing on the matching models.

2. But the OR does not work
http://localhost:8983/solr/cars/select?fl=text,score&defType=edismax&q=suv&boost=query($boostq,1)&boostq=toyota%20or%20honda
The scores stay at the baseline suggesting no match on the boostQ.


3. For reference, the bq parameter works fine.

>From a use case perspective, the idea was to pass in user preferences into
the BoostQ e.g. projects  the user has worked on etc.when matching documents






















On Sat, Apr 6, 2013 at 10:19 AM, Yonik Seeley  wrote:

> On Sat, Apr 6, 2013 at 9:42 AM, dc tech  wrote:
> > See example below
> > 1. Search for SUVs and boost   Honda models
> > q=suv&boost=query({! v='honda'},1)
> >
> > 2. Search for SUVs and boost   Honda OR  toyota model
> >
> > a) Using OR in the query does NOT work
> >q=suv&boost=query({! v='honda or toyota'},1)
>
> The "or" needs to be uppercase "OR".
>
> It might also be easier to compose and read like this:
> q=suv
> boost=query($boostQ)
> boostQ=honda OR toyota
>
> OF course something simpler like this might also serve your primary goal:
> q=+suv (honda OR toyota)^10
>
>
> -Yonik
> http://lucidworks.com
>


Boost parameter with query function - how to pass in complex params?

2013-04-06 Thread dc tech
See example below
1. Search for SUVs and boost   Honda models
q=suv&boost=query({! v='honda'},1)

2. Search for SUVs and boost   Honda OR  toyota model

a) Using OR in the query does NOT work
   q=suv&boost=query({! v='honda or toyota'},1)

b) Using two query functions and summing the boosts DOES work
Works:   q=suv&boost=sum(query({!v='honda'},1),query({!v='toyota'},1))

Any thoughts?


Re: using edismax without velocity

2013-04-06 Thread DC tech
Definitely in 4.x release. Did you try it and found a problem?



RE: MoreLikeThis - Odd results - what am I doing wrong?

2013-04-03 Thread DC tech
Thanks David - I suppose it is an AWS question and thank you for the pointers. 

As a further input to the MLT question - it does seem that 3.6 behavior is 
different from 4.2 - the issue seems to be more in terms of the raw query that 
is generated. 
I will some more research and revert back with details. 

David Parks  wrote:

>Isn't this an AWS security groups question? You should probably post this 
>question on the AWS forums, but for the moment, here's the basic reading 
>material - go set up your EC2 security groups and lock down your systems.
>
>   
> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html
>
>If you just want to password protect Solr here are the instructions:
>
>   http://wiki.apache.org/solr/SolrSecurity
>
>But I most certainly would not leave it open to the world even with a password 
>(note that the basic password authentication sends passwords in clear text if 
>you're not using HTTPS, best lock the thing down behind a firewall).
>
>Dave
>
>
>-Original Message-
>From: DC tech [mailto:dctech1...@gmail.com] 
>Sent: Tuesday, April 02, 2013 1:02 PM
>To: solr-user@lucene.apache.org
>Subject: Re: MoreLikeThis - Odd results - what am I doing wrong?
>
>OK - so I have my SOLR instance running on AWS. 
>Any suggestions on how to safely share the link?  Right now, the whole SOLR 
>instance is totally open. 
>
>
>
>Gagandeep singh  wrote:
>
>>say &debugQuery=true&mlt=true and see the scores for the MLT query, not 
>>a sample query. You can use Amazon ec2 to bring up your solr, you 
>>should be able to get a micro instance for free trial.
>>
>>
>>On Mon, Apr 1, 2013 at 5:10 AM, dc tech  wrote:
>>
>>> I did try the raw query against the *simi* field and those seem to 
>>> return results in the order expected.
>>> For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
>>> Running a query with those words against the simi field returns the 
>>> expected models (X5, Audi Q5, etc) and then the subsequent documents 
>>> have decreasing relevance. So the basic query mechanism seems to be fine.
>>>
>>> The issue just seems to be with MoreLikeThis component and handler.
>>> I can post the index on a public SOLR instance - any suggestions? (or 
>>> for
>>> hosting)
>>>
>>>
>>> On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh 
>>> >> >wrote:
>>>
>>> > If you can bring up your solr setup on a public machine then im 
>>> > sure a
>>> lot
>>> > of debugging can be done. Without that, i think what you should 
>>> > look at
>>> is
>>> > the tf-idf scores of the terms like "camry" etc. Usually idf is the 
>>> > deciding factor into which results show at the top (tf should be 1 
>>> > for
>>> your
>>> > data).
>>> > Enable &debugQuery=true and look at explain section to see show 
>>> > score is getting calculated.
>>> >
>>> > You should try giving different boosts to class, type, drive, size 
>>> > to control the results.
>>> >
>>> >
>>> > On Sun, Mar 31, 2013 at 8:52 PM, dc tech  wrote:
>>> >
>>> >> I am running some experiments on more like this and the results 
>>> >> seem rather odd - I am doing something wrong but just cannot figure out 
>>> >> what.
>>> >> Basically, the similarity results are decent - but not great.
>>> >>
>>> >> *Issue 1  = Quality*
>>> >> Toyota Camry : finds Altima (good) but then next one is Camry 
>>> >> Hybrid whereas it should have found Accord.
>>> >> I have normalized the data into a simi field which has only the 
>>> >> attributes that I care about.
>>> >> Without the simi field, I could not get mlt.qf boosts to work well
>>> enough
>>> >> to return results
>>> >>
>>> >> *Issue 2*
>>> >> Some fields do not work at all. For instance, text+simi (in 
>>> >> mlt.fl)
>>> works
>>> >> whereas just simi does not.
>>> >> So some weirdness that am just not understanding.
>>> >>
>>> >> Would be grateful for your guidance !
>>> >>
>>> >>
>>> >> Here is the setup:
>>> >> *1. SOLR Version*
>>> >> solr-spec 4.2.0.2013.03.06.22.32.13
>>> >> solr-impl 4.2.0 1453694   rmuir - 2013

Re: MoreLikeThis - Odd results - what am I doing wrong?

2013-04-01 Thread DC tech
OK - so I have my SOLR instance running on AWS. 
Any suggestions on how to safely share the link?  Right now, the whole SOLR 
instance is totally open. 



Gagandeep singh  wrote:

>say &debugQuery=true&mlt=true and see the scores for the MLT query, not a
>sample query. You can use Amazon ec2 to bring up your solr, you should be
>able to get a micro instance for free trial.
>
>
>On Mon, Apr 1, 2013 at 5:10 AM, dc tech  wrote:
>
>> I did try the raw query against the *simi* field and those seem to return
>> results in the order expected.
>> For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
>> Running a query with those words against the simi field returns the
>> expected models (X5, Audi Q5, etc) and then the subsequent documents have
>> decreasing relevance. So the basic query mechanism seems to be fine.
>>
>> The issue just seems to be with MoreLikeThis component and handler.
>> I can post the index on a public SOLR instance - any suggestions? (or for
>> hosting)
>>
>>
>> On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh > >wrote:
>>
>> > If you can bring up your solr setup on a public machine then im sure a
>> lot
>> > of debugging can be done. Without that, i think what you should look at
>> is
>> > the tf-idf scores of the terms like "camry" etc. Usually idf is the
>> > deciding factor into which results show at the top (tf should be 1 for
>> your
>> > data).
>> > Enable &debugQuery=true and look at explain section to see show score is
>> > getting calculated.
>> >
>> > You should try giving different boosts to class, type, drive, size to
>> > control the results.
>> >
>> >
>> > On Sun, Mar 31, 2013 at 8:52 PM, dc tech  wrote:
>> >
>> >> I am running some experiments on more like this and the results seem
>> >> rather odd - I am doing something wrong but just cannot figure out what.
>> >> Basically, the similarity results are decent - but not great.
>> >>
>> >> *Issue 1  = Quality*
>> >> Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
>> >> whereas it should have found Accord.
>> >> I have normalized the data into a simi field which has only the
>> >> attributes that I care about.
>> >> Without the simi field, I could not get mlt.qf boosts to work well
>> enough
>> >> to return results
>> >>
>> >> *Issue 2*
>> >> Some fields do not work at all. For instance, text+simi (in mlt.fl)
>> works
>> >> whereas just simi does not.
>> >> So some weirdness that am just not understanding.
>> >>
>> >> Would be grateful for your guidance !
>> >>
>> >>
>> >> Here is the setup:
>> >> *1. SOLR Version*
>> >> solr-spec 4.2.0.2013.03.06.22.32.13
>> >> solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
>> >> lucene-spec 4.2.0
>> >> lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29
>> >>
>> >> *2. Machine Information*
>> >> Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
>> >> 19.0-b09)
>> >> Windows 7 Home 64 Bit with 4 GB RAM
>> >>
>> >> *3. Sample Data *
>> >> I created this 'dummy' data of cars  - the idea being that these would
>> be
>> >> sufficient and simple to generate similarity and understand how it would
>> >> work.
>> >> There are 181 rows in the data set (I have attached it for reference in
>> >> CSV format)
>> >>
>> >> [image: Inline image 1]
>> >>
>> >> *4. SCHEMA*
>> >> *Field Definitions*
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >>> stored="true"
>> >> termVectors="true" multiValued="true"/>
>> >>> >> termVectors="true" multiValued="false"/>
>> >> *
>>

Re: MoreLikeThis - Odd results - what am I doing wrong?

2013-03-31 Thread dc tech
I did try the raw query against the *simi* field and those seem to return
results in the order expected.
For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
Running a query with those words against the simi field returns the
expected models (X5, Audi Q5, etc) and then the subsequent documents have
decreasing relevance. So the basic query mechanism seems to be fine.

The issue just seems to be with MoreLikeThis component and handler.
I can post the index on a public SOLR instance - any suggestions? (or for
hosting)


On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh wrote:

> If you can bring up your solr setup on a public machine then im sure a lot
> of debugging can be done. Without that, i think what you should look at is
> the tf-idf scores of the terms like "camry" etc. Usually idf is the
> deciding factor into which results show at the top (tf should be 1 for your
> data).
> Enable &debugQuery=true and look at explain section to see show score is
> getting calculated.
>
> You should try giving different boosts to class, type, drive, size to
> control the results.
>
>
> On Sun, Mar 31, 2013 at 8:52 PM, dc tech  wrote:
>
>> I am running some experiments on more like this and the results seem
>> rather odd - I am doing something wrong but just cannot figure out what.
>> Basically, the similarity results are decent - but not great.
>>
>> *Issue 1  = Quality*
>> Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
>> whereas it should have found Accord.
>> I have normalized the data into a simi field which has only the
>> attributes that I care about.
>> Without the simi field, I could not get mlt.qf boosts to work well enough
>> to return results
>>
>> *Issue 2*
>> Some fields do not work at all. For instance, text+simi (in mlt.fl) works
>> whereas just simi does not.
>> So some weirdness that am just not understanding.
>>
>> Would be grateful for your guidance !
>>
>>
>> Here is the setup:
>> *1. SOLR Version*
>> solr-spec 4.2.0.2013.03.06.22.32.13
>> solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
>> lucene-spec 4.2.0
>> lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29
>>
>> *2. Machine Information*
>> Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
>> 19.0-b09)
>> Windows 7 Home 64 Bit with 4 GB RAM
>>
>> *3. Sample Data *
>> I created this 'dummy' data of cars  - the idea being that these would be
>> sufficient and simple to generate similarity and understand how it would
>> work.
>> There are 181 rows in the data set (I have attached it for reference in
>> CSV format)
>>
>> [image: Inline image 1]
>>
>> *4. SCHEMA*
>> *Field Definitions*
>>> termVectors="true" multiValued="false"/>
>>> termVectors="true" multiValued="false"/>
>>> termVectors="true" multiValued="false"/>
>>> termVectors="true" multiValued="false"/>
>>> termVectors="true" multiValued="false"/>
>>> termVectors="true" multiValued="false"/>
>>> termVectors="true" multiValued="true"/>
>>> termVectors="true" multiValued="false"/>
>> *
>> *
>> *Copy Fields*
>>   
>>   
>>   
>>   
>>   
>>   
>>   
>>   
>>   
>>   
>>   
>>   
>>   
>>   
>>   
>>   
>> *  *
>> *  
>> *
>> *  *
>> *  
>> *
>>
>> Note that the "simi" field ends up with values like  make, class, size
>> and drive:
>> - Luxury SUV 4WD Large
>> - Standard Sedan Front Familt
>>
>>
>> *5. MLT Setup*
>> a. mlt.FL  = *text* QF=*text*  Works but results are obviously not good
>> (make is not a good similarity indicator)
>>
>> http://localhost:8983/solr/cars/select/?q=id:2&mlt=true&fl=text&mlt.fl=text&mlt.qf=text
>>
>> b. mlt.FL  = *simi* QF=*simi*  Does not work at all (0 results)
>>
>> http://localhost:8983/solr/cars/select/?q=id:2&mlt=true&fl=text&mlt.fl=simi&mlt.qf=simi
>>
>> c.  mlt.FL  = *simi,text * QF=*simi^10 text^.1*   Works with decent
>> results in most cases
>>
>> http://localhost:8983/solr/cars/select/?q=id:2&mlt=true&fl=text&mlt.fl=simi,text&mlt.qf=simi
>> ^10%20text^.01
>> Works for getting similarity for Acura MDX (Luxury SUV 4WD Large)
>> But for Toyota Camry - it finds hybrid family cars (Prius) ahead of Honda.
>>
>>
>> *
>> *
>>
>>
>>
>>
>>
>>
>>
>>
>


MoreLikeThis - Odd results - what am I doing wrong?

2013-03-31 Thread dc tech
I am running some experiments on more like this and the results seem rather
odd - I am doing something wrong but just cannot figure out what.
Basically, the similarity results are decent - but not great.

*Issue 1  = Quality*
Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
whereas it should have found Accord.
I have normalized the data into a simi field which has only the attributes
that I care about.
Without the simi field, I could not get mlt.qf boosts to work well enough
to return results

*Issue 2*
Some fields do not work at all. For instance, text+simi (in mlt.fl) works
whereas just simi does not.
So some weirdness that am just not understanding.

Would be grateful for your guidance !


Here is the setup:
*1. SOLR Version*
solr-spec 4.2.0.2013.03.06.22.32.13
solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
lucene-spec 4.2.0
lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29

*2. Machine Information*
Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09)
Windows 7 Home 64 Bit with 4 GB RAM

*3. Sample Data *
I created this 'dummy' data of cars  - the idea being that these would be
sufficient and simple to generate similarity and understand how it would
work.
There are 181 rows in the data set (I have attached it for reference in CSV
format)

[image: Inline image 1]

*4. SCHEMA*
*Field Definitions*
   
   
   
   
   
   
   
   
*
*
*Copy Fields*
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
*  *
*  *
*  *
*  *

Note that the "simi" field ends up with values like  make, class, size and
drive:
- Luxury SUV 4WD Large
- Standard Sedan Front Familt


*5. MLT Setup*
a. mlt.FL  = *text* QF=*text*  Works but results are obviously not good
(make is not a good similarity indicator)
http://localhost:8983/solr/cars/select/?q=id:2&mlt=true&fl=text&mlt.fl=text&mlt.qf=text

b. mlt.FL  = *simi* QF=*simi*  Does not work at all (0 results)
http://localhost:8983/solr/cars/select/?q=id:2&mlt=true&fl=text&mlt.fl=simi&mlt.qf=simi

c.  mlt.FL  = *simi,text * QF=*simi^10 text^.1*   Works with decent results
in most cases
http://localhost:8983/solr/cars/select/?q=id:2&mlt=true&fl=text&mlt.fl=simi,text&mlt.qf=simi
^10%20text^.01
Works for getting similarity for Acura MDX (Luxury SUV 4WD Large)
But for Toyota Camry - it finds hybrid family cars (Prius) ahead of Honda.


*
*
<>id,make,model,class,type,drive,comment,size,size_i
1,Acura ,ILX 2.0L,Luxury,Sedan,Front,,Mini,2
2,Acura ,MDX,Luxury,SUV,4wd,,Large,5
3,Acura ,RDX,Luxury,SUV,4wd,,Small,3
4,Acura ,RLX,Luxury,Sedan,AWD,,Large,5
5,Acura ,TL,Luxury,Sedan,Front,,Family,4
6,Acura ,TSX,Luxury,Sedan,Front,,Small,3
7,Acura ,ZDX,Luxury,SUV,4wd,,Large,5
8,Audi ,A3 2.0T,Luxury,Sedan,AWD,,Mini,2
9,Audi ,A4,Luxury,Sedan,AWD,,Small,3
10,Audi ,A5 2.0T,Luxury,Sedan,AWD,,Family,4
11,Audi ,A6 3.0T,Luxury,Sedan,AWD,,Family,4
12,Audi ,A7,Luxury,Sedan,AWD,,Large,5
13,Audi ,A8,Luxury,Sedan,AWD,,Largest,7
14,Audi ,Allroad,Luxury,Wagon,AWD,,Large,5
15,Audi ,Q5 2.0T,Luxury,SUV,4wd,,Large,5
16,Audi ,Q7,Luxury,SUV,4wd,,Largest,7
17,Audi ,R8,Luxury,Sports,RWD,,Largest,7
18,Audi ,S4,Luxury,Sports,AWD,,Small,3
19,Audi ,TT,Luxury,Coupe,Front,,Mini,2
20,BMW ,135i,Luxury,Sedan,RWD,,Mini,2
21,BMW ,328i,Luxury,Sedan,RWD,,Small,3
22,BMW ,4 Series,Luxury,Sedan,RWD,,Family,4
23,BMW ,535i,Luxury,Sedan,RWD,,Large,5
24,BMW ,6 Series,Luxury,Sedan,RWD,,Very Large,6
25,BMW ,750Li,Luxury,Sedan,RWD,,Largest,7
26,BMW ,X1 xDrive28i (2.0T),Luxury,SUV,4wd,,Mini,2
27,BMW ,X3 xDrive28i (2.0T),Luxury,SUV,4wd,,Small,3
28,BMW ,X5 35i,Luxury,SUV,4wd,,Large,5
29,BMW ,X6,Luxury,SUV,4wd,,Very Large,6
30,BMW ,Z4 sDrive28i,Luxury,Sports,RWD,,Mini,2
31,Buick ,Enclave,High,SUV,4wd,,Large,5
32,Cadillac ,ATS (turbo),Luxury,Sedan,RWD,,Mini,2
33,Cadillac ,CTS (V6),Luxury,Sedan,RWD,,Family,4
34,Cadillac ,Escalade,Luxury,SUV,4wd,,Largest,7
35,Cadillac ,SRX,Luxury,SUV,4wd,,Large,5
36,Cadillac ,XTS,Luxury,Sedan,RWD,,Small,3
37,Chevrolet ,Camaro 2LT (V6),Standard,Sports,RWD,,Small,3
38,Chevrolet ,Colorado,Standard,Pickup,4wd,,Small,3
39,Chevrolet ,Corvette Z06,Standard,Sports,RWD,,Small,3
40,Chevrolet ,Cruze 1LT (1.4T),Standard,Sedan,Front,,Mini,2
41,Chevrolet ,Cruze Eco,Standard,Sedan,Front,,Mini,2
42,Chevrolet ,Cruze LS (1.8),Standard,Sedan,Front,,Mini,2
43,Chevrolet ,Equinox (4-cyl.),Standard,SUV,4wd,,Mini,2
44,Chevrolet ,Express,Standard,Commercial,RWD,,Largest,7
45,Chevrolet ,Impala,Standard,Sedan,Front,,Large,5
46,Chevrolet ,Malibu 1LT (2.5),Standard,Sedan,Front,,Small,3
47,Chevrolet ,Malibu Eco,Standard,Sedan,Front,Hybrid,Small,3
48,Chevrolet ,Silverado 1500 5.3 V8,Standard,Pickup,4wd,,Large,5
49,Chevrolet ,Sonic LTZ (1.4T),Standard,Sedan,Front,,Mini,2
50,Chevrolet ,Suburban,Standard,SUV,4wd,,Largest,7
51,Chevrolet ,Volt,Standard,Sedan,Front,Plugin Hybrid,Small,3
52,Chrysler ,200 (V6),Standard,Sedan,Front,,Small,3
53,Chrysler ,300 C,Standard,Sedan,RWD,,Large,5
54,Chrysler ,Town & Country,Standard,Minivan,Front,,Largest,7
55,Coda ,EV,Standard,Sedan,Front,EV Hybrid,Small,3
56,Ford ,C-M

Re: solr benchmarks

2011-01-03 Thread dc tech
Tri:
What is the volume of content (# of documents) and index size you are
expecting? What about the document complexity in terms of # of fields, what
are you storing in the index, complexity of the queries etc?

We have used SOLR with 10m documents with 1-3 second response times on the
front end  - this is with minimal tuning, 4-5 facet fields and large blobs
of content in the index and jRuby on Rails and complex queries and under low
load conditions (hence caches are probably not warmed much).

We have external search application almost fully powered by SOLR (except for
web crawl) and the response is of the typically less than 1 second with
about 100k documents. Solr time is probably 100-200 ms of this.

My sense is that SOLR is as fast as it gets and scales very, very well. On
the user group, I have seen reference to people using SOLR for 100m
documents or more. It would be useful to get your use case(s).





On Mon, Jan 3, 2011 at 10:44 AM, Jak Akdemir  wrote:

> Hi,
> You can find benchmark results but these are not directly based on "index
> size vs. response time"
> http://wiki.apache.org/solr/SolrPerformanceData
>
> On Sat, Jan 1, 2011 at 4:06 AM, Tri Nguyen  wrote:
>
> > Hi,
> >
> > I remember going through some page that had graphs of response times
> based
> > on index size for solr.
> >
> > Anyone know of such pages?
> >
> > Internally, we have some requirements for response times and I'm trying
> to
> > figure out when to shard the index.
> >
> > Thanks,
> >
> > Tri
>


Re: Indexing Hanging during GC?

2010-08-12 Thread dc tech
1) I assume you are doing batching interspersed with commits
2) Why do you need sentence level Lucene docs?
3) Are your custom handlers/parsers a part of SOLR jvm? Would not be
surprised if you a memory/connection leak their (or it is not
releasing some resource explicitly)

In general, we have NEVER had a problem in loading Solr.

On 8/12/10, Rebecca Watson  wrote:
> sorry -- i used the term "documents" too loosely!
>
> 180k scientific articles with between 500-1000 sentences each
> and we index sentence-level index documents
> so i'm guessing about 100 million lucene index documents in total.
>
> an update on my progress:
>
> i used GC settings of:
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
>   -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
> -XX:CMSInitiatingOccupancyFraction=70
>
> which allowed the indexing process to run to 11.5k articles and
> for about 2hours before I got the same kind of hanging/unresponsive Solr
> with
> this as the tail of the solr logs:
>
> Before GC:
> Statistics for BinaryTreeDictionary:
> 
> Total Free Space: 2416734
> Max   Chunk Size: 2412032
> Number of Blocks: 3
> Av.  Block  Size: 805578
> Tree  Height: 3
> 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.193 secs]5980.480:
> [CMS
>
> I also saw (in jconsole) that the number of threads rose from the
> steady 32 used for the
> 2 hours to 72 before Solr finally became unresponsive...
>
> i've got the following GC info params switched on (as many as i could
> find!):
> -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>   -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
>   -XX:PrintFLSStatistics=1
>
> with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
> million fairly small
> docs per hour!! this produced an index of about 40GB to give you an
> idea of index
> size...
>
> because i've already got the documents in solr native xml format
> i.e. one file per article each with ...
> i.e. posting each set of sentence docs per article in every LCF file post...
> this means that LCF can throw documents at Solr very fast and i think
> i'm
> breaking it GC-wise.
>
> i'm going to try adding in System.gc() calls to see if this runs ok
> (albeit slower)...
> otherwise i'm pretty much at a loss as to what could be causing this GC
> issue/
> solr hanging if it's not a GC issue...
>
> thanks :)
>
> bec
>
> On 12 August 2010 21:42, dc tech  wrote:
>> I am a little confused - how did 180k documents become 100m index
>> documents?
>> We use have over 20 indices (for different content sets), one with 5m
>> documents (about a couple of pages each) and another with 100k+ docs.
>> We can index the 5m collection in a couple of days (limitation is in
>> the source) which is 100k documents an hour without breaking a sweat.
>>
>>
>>
>> On 8/12/10, Rebecca Watson  wrote:
>>> Hi,
>>>
>>> When indexing large amounts of data I hit a problem whereby Solr
>>> becomes unresponsive
>>> and doesn't recover (even when left overnight!). I think i've hit some
>>> GC problems/tuning
>>> is required of GC and I wanted to know if anyone has ever hit this
>>> problem.
>>> I can replicate this error (albeit taking longer to do so) using
>>> Solr/Lucene analysers
>>> only so I thought other people might have hit this issue before over
>>> large data sets
>>>
>>> Background on my problem follows -- but I guess my main question is --
>>> can
>>> Solr
>>> become so overwhelmed by update posts that it becomes completely
>>> unresponsive??
>>>
>>> Right now I think the problem is that the java GC is hanging but I've
>>> been working
>>> on this all week and it took a while to figure out it might be
>>> GC-based / wasn't a
>>> direct result of my custom analysers so i'd appreciate any advice anyone
>>> has
>>> about indexing large document collections.
>>>
>>> I also have a second questions for those in the know -- do we have a
>>> chance
>>> of indexing/searching over our large dataset with what little hardware
>>> we already
>>> have available??
>>>
>>> thanks in advance :)
>>>
>>> bec
>>>
>>> a bit of background: ---
>>>
>>> I've got a large collection of articles we want to index/search over
>>&g

Re: Indexing Hanging during GC?

2010-08-12 Thread dc tech
I am a little confused - how did 180k documents become 100m index documents?
We use have over 20 indices (for different content sets), one with 5m
documents (about a couple of pages each) and another with 100k+ docs.
We can index the 5m collection in a couple of days (limitation is in
the source) which is 100k documents an hour without breaking a sweat.



On 8/12/10, Rebecca Watson  wrote:
> Hi,
>
> When indexing large amounts of data I hit a problem whereby Solr
> becomes unresponsive
> and doesn't recover (even when left overnight!). I think i've hit some
> GC problems/tuning
> is required of GC and I wanted to know if anyone has ever hit this problem.
> I can replicate this error (albeit taking longer to do so) using
> Solr/Lucene analysers
> only so I thought other people might have hit this issue before over
> large data sets
>
> Background on my problem follows -- but I guess my main question is -- can
> Solr
> become so overwhelmed by update posts that it becomes completely
> unresponsive??
>
> Right now I think the problem is that the java GC is hanging but I've
> been working
> on this all week and it took a while to figure out it might be
> GC-based / wasn't a
> direct result of my custom analysers so i'd appreciate any advice anyone has
> about indexing large document collections.
>
> I also have a second questions for those in the know -- do we have a chance
> of indexing/searching over our large dataset with what little hardware
> we already
> have available??
>
> thanks in advance :)
>
> bec
>
> a bit of background: ---
>
> I've got a large collection of articles we want to index/search over
> -- about 180k
> in total. Each article has say 500-1000 sentences and each sentence has
> about
> 15 fields, many of which are multi-valued and we store most fields as well
> for
> display/highlighting purposes. So I'd guess over 100 million index
> documents.
>
> In our small test collection of 700 articles this results in a single index
> of
> about 13GB.
>
> Our pipeline processes PDF files through to Solr native xml which we call
> "index.xml" files i.e. in ... format ready to post straight to
> Solr's
> update handler.
>
> We create the index.xml files as we pull in information from
> a few sources and creation of these files from their original PDF form is
> farmed out across a grid and is quite time-consuming so we distribute this
> process rather than creating index.xml files on the fly...
>
> We do a lot of linguistic processing and to enable search functionality
> of our resulting terms requires analysers that split terms/ join terms
> together
> i.e. custom analysers that perform string operations and are quite
> time-consuming/
> have large overhead compared to most analysers (they take approx
> 20-30% more time
> and use twice as many short-lived objects than the "text" field type).
>
> Right now i'm working on my new Imac:
> quad-core 2.8 GHz intel Core i7
> 16 GB 1067 MHz DDR3 RAM
> 2TB hard-drive (about half free)
> Version 10.6.4 OSX
>
> Production environment:
> 2 linux boxes each with:
> 8-core Intel(R) Xeon(R) CPU @ 2.00GHz
> 16GB RAM
>
> I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
> right now).
>
> I setup Solr to use autocommit as we'll have several document collections /
> post
> to Solr from different data sets:
>
>  
> 
>   50 
>   90 
> 
>
> I also have
>   false
> 1024
> 10
> -
>
> *** First question:
> Has anyone else found that Solr hangs/becomes unresponsive after too
> many documents are indexed at once i.e. Solr can't keep up with the post
> rate?
>
> I've got LCF crawling my local test set (file system connection
> required only) and
> posting documents to Solr using 6GB of RAM. As I said above, these documents
> are in native Solr XML format () with one file per article so
> each
>  contains all the sentence-level documents for the article.
>
> With LCF I post about 2.5/3k articles (files) per hour -- so about
> 2.5k*500 /3600 =
> 350 s per second post-rate -- is this normal/expected??
>
> Eventually, after about 3000 files (an hour or so) Solr starts to
> hang/becomes
> unresponsive and with Jconsole/GC logging I can see that the Old-Gen space
> is
> about 90% full and the following is the end of the solr log file-- where you
> can see GC has been called:
> --
> 3012.290: [GC Before GC:
> Statistics for BinaryTreeDictionary:
> 
> Total Free Space: 53349392
> Max   Chunk Size: 3200168
> Number of Blocks: 66
> Av.  Block  Size: 808324
> Tree  Height: 13
> Before GC:
> Statistics for BinaryTreeDictionary:
> 
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree  Height: 0
> 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K),
> 0.0769802 secs]3012.367: [CMS
> 

Re: Facet Fields - ID vs. Display Value

2010-08-09 Thread dc tech
I think depends on what you need:
1) Simple,unique category - use display facet
2) Categories may be duplicate from display perspective (eg authors) :
store display#id in facet field but show only display
3) Internationalization requirements - store I'd but have ui pull and
display the translated labels

On 8/9/10, Frank A  wrote:
> What I meant (which I realize now wasn't very clear) was if I have
> something like categoryID and categorylabel - is the normal practice
> to define categoryID as the facet field and then have the UI layer
> display the label?  Or would it be normal to directly use
> categorylabel as the facet field?
>
>
>
> On Mon, Aug 9, 2010 at 6:01 PM, Otis Gospodnetic
>  wrote:
>> Hi Frank,
>>
>> I'm not sure what you mean by that.
>> If the question is about what should be shown in the UI, it should be
>> something
>> pretty and human-readable, such as the original facet string value,
>> assuming it
>> was nice and clean.
>>
>> Otis
>> 
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> - Original Message 
>>> From: Frank A 
>>> To: solr-user@lucene.apache.org
>>> Sent: Mon, August 9, 2010 5:19:57 PM
>>> Subject: Facet Fields - ID vs. Display Value
>>>
>>> Is there a general best practice on whether facet fields should be on
>>> "IDs"  or "Display values"?
>>>
>>> -Frank
>>>
>>
>

-- 
Sent from my mobile device


Re: want to display elevated results on my display result screen differently.

2010-08-03 Thread dc tech
Have you looked at the relevance scores? I would speculate  elevate
matches would have constant, high score.

On 8/3/10, Vishal.Arora  wrote:
>
> Suppose i have elevate.xml file and i elevate the ID :- Artist:11650  and
> Artist:510 when i search for corgan
> this is elevate File
>   
>   
>   
>   
>   
>   
>   
>   
>
>
> Is there any way (query parameter) which give us clue which ids are elevated
> when actual search done for corgan
>
> When we search than the result xml structure is same as normal search
> without elevation. I want to display elevated results on my display result
> screen differently.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Show-elevated-Result-Differently-tp1002081p1018879.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Sent from my mobile device


Re: Solr searching performance issues, using large documents

2010-07-29 Thread dc tech
Are you storing the entire log file text in SOLR? That's almost 3gb of
text that you are storing in the SOLR. Try to
1) Is this first time performance or on repaat queries with the same fields?
2) Optimze the index and test performance again
3) index without storing the text and see what the performance looks like.


On 7/29/10, Peter Spam  wrote:
> Any ideas?  I've got 5000 documents with an average size of 850k each, and
> it sometimes takes 2 minutes for a query to come back when highlighting is
> turned on!  Help!
>
>
> -Pete
>
> On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
>
>> From the mailing list archive, Koji wrote:
>>
>>> 1. Provide another field for highlighting and use copyField to copy
>>> plainText to the highlighting field.
>>
>> and Lance wrote:
>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
>>
>>> If you want to highlight field X, doing the
>>> termOffsets/termPositions/termVectors will make highlighting that field
>>> faster. You should make a separate field and apply these options to that
>>> field.
>>>
>>> Now: doing a copyfield adds a "value" to a multiValued field. For a text
>>> field, you get a multi-valued text field. You should only copy one value
>>> to the highlighted field, so just copyField the document to your special
>>> field. To enforce this, I would add multiValued="false" to that field,
>>> just to avoid mistakes.
>>>
>>> So, all_text should be indexed without the term* attributes, and should
>>> not be stored. Then your document stored in a separate field that you use
>>> for highlighting and has the term* attributes.
>>
>> I've been experimenting with this, and here's what I've tried:
>>
>>   > multiValued="true" termVectors="true" termPositions="true" termOff
>> sets="true" />
>>   > multiValued="true" />
>>   
>>
>> ... but it's still very slow (10+ seconds).  Why is it better to have two
>> fields (one indexed but not stored, and the other not indexed but stored)
>> rather than just one field that's both indexed and stored?
>>
>>
>> From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
>>
>>> If you aren't always using all the stored fields, then enabling lazy
>>> field loading can be a huge boon, especially if compressed fields are
>>> used.
>>
>> What does this mean?  How do you load a field lazily?
>>
>> Thanks for your time, guys - this has started to become frustrating, since
>> it works so well, but is very slow!
>>
>>
>> -Pete
>>
>> On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
>>
>>> Data set: About 4,000 log files (will eventually grow to millions).
>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>
>>> Problem: When I search for common terms, the query time goes from under
>>> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
>>> disable highlighting, performance improves a lot, but is still slow for
>>> some queries (7 seconds).  Thanks in advance for any ideas!
>>>
>>>
>>> -Peter
>>>
>>>
>>> -
>>>
>>> 4GB RAM server
>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>
>>> -
>>>
>>> schema.xml changes:
>>>
>>>   
>>> 
>>>   
>>> 
>>> >> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>> catenateAll="0" splitOnCaseChange="0"/>
>>> 
>>>   
>>>
>>> ...
>>>
>>>  >> multiValued="false" termVectors="true" termPositions="true"
>>> termOffsets="true" />
>>>   >> default="NOW" multiValued="false"/>
>>>  >> multiValued="false"/>
>>>  >> multiValued="false"/>
>>>  >> multiValued="false"/>
>>>  >> multiValued="false"/>
>>>  >> multiValued="false"/>
>>>  >> multiValued="false"/>
>>>  >> multiValued="false"/>
>>>
>>> ...
>>>
>>> 
>>> body
>>> 
>>>
>>> -
>>>
>>> solrconfig.xml changes:
>>>
>>>   2147483647
>>>   128
>>>
>>> -
>>>
>>> The query:
>>>
>>> rowStr = "&rows=10"
>>> facet =
>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>> regexv = "(?m)^.*\n.*\n.*$"
>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) +
>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/,
>>> '').gsub(/([:~!<>="])/,'\1') + fuzzy + minLogSizeStr)
>>>
>>> thequery = '/solr/select?timeAllow

Re: Solr using 1500 threads - is that normal?

2010-07-28 Thread dc tech
1,500 threads seems extreme by any standards so there is something
happening in your install. Even with appservers for web apps,
typically 100 would be a fair # of threads.


On 7/28/10, Christos Constantinou  wrote:
> Hi,
>
> Solr seems to be crashing after a JVM exception that new threads cannot be
> created. I am writing in hope of advice from someone that has experienced
> this before. The exception that is causing the problem is:
>
> Exception in thread "btpool0-5" java.lang.OutOfMemoryError: unable to create
> new native thread
>
> The memory that is allocated to Solr is 3072MB, which should be enough
> memory for a ~6GB data set. The documents are not big either, they have
> around 10 fields of which only one stores large text ranging between 1k-50k.
>
> The top command at the time of the crash shows Solr using around 1500
> threads, which I assume it is not normal. Could it be that the threads are
> crashing one by one and new ones are created to cope with the queries?
>
> In the log file, right after the the exception, there are several thousand
> commits before the server stalls completely. Normally, the log file would
> report 20-30 document existence queries per second, then 1 commit per 5-30
> seconds, and some more infrequent faceted document searches on the data.
> However after the exception, there are only commits until the end of the log
> file.
>
> I am wondering if anyone has experienced this before or if it is some sort
> of known bug from Solr 1.4? Is there a way to increase the details of the
> exception in the logfile?
>
> I am attaching the output of a grep Exception command on the logfile.
>
> Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:19:32 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:20:18 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:20:48 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:22:43 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:28:50 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:33:19 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:35:08 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:35:58 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:35:59 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:44:31 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=2, try again later.
> Jul 28, 2010 8:51:49 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrEx

Re: Performance issues when querying on large documents

2010-07-24 Thread dc tech
Are you storing the full 1,000 pages in the index? If so, that is
probably not helping either.

On 7/23/10, ahammad  wrote:
>
> Hello,
>
> I have an index with lots of different types of documents. One of those
> types basically contains extracts of PDF docs. Some of those PDFs can have
> 1000+ pages, so there would be a lot of stuff to search through.
>
> I am experiencing really terrible performance when querying. My whole index
> has about 270k documents, but less than 1000 of those are the PDF extracts.
> The slow querying occurs when I search only on those PDF extracts (by
> specifying filters), and return 100 results. The 100 results definitely adds
> to the issue, but even cutting that down can be slow.
>
> Is there a way to improve querying with such large results? To give an idea,
> querying for a single word can take a little over a minute, which isn't
> really viable for an application that revolves around searching. For now, I
> have limited the results to 20, which makes the query execute in roughly
> 10-15 seconds. However, I would like to have the option of returning 100
> results.
>
> Thanks a lot.
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Performance-issues-when-querying-on-large-documents-tp990590p990590.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Sent from my mobile device


Re: Personalized Search

2010-05-21 Thread dc tech
Excluding favorited items is an easier problem
- get the results
- get exclude list from db
- scan results and exclude the items in the item list

You'd have to do some code to manage 'holes' in the result list ie
fetch more etc.

You could marry this with the solr batch based approach to reduce the holes :
- Every night, update the item.users field. This can be simple string
type of  field.
- query with negative criteria ie
   content:search_term AND -users:userid
- then do the steps outlined earlier

On 5/21/10, Rih  wrote:
>>
>> - keep the SOLR index independent of bought/like
>
> - have a db table with user prefs on a per item basis
>
>
> I have the same idea this far.
>
> at query time, specify boosts for 'my items' items
>
>
> I believe this works if you want to sort results by faved/not faved. But how
> does it scale if users already favorited/liked hundreds of item? The query
> can be quite long.
>
> Looking forward to your idea.
>
>
>
> On Thu, May 20, 2010 at 6:37 PM, dc tech  wrote:
>
>> Another approach would be to do query time boosts of 'my' items under
>> the assumption that count is limited:
>> - keep the SOLR index independent of bought/like
>> - have a db table with user prefs on a per item basis
>> - at query time, specify boosts for 'my items' items
>>
>> We are planning to do this in the context of document management where
>> documents in 'my (used/favorited ) folders' provide a boost factor
>> to the results.
>>
>>
>>
>> On 5/20/10, findbestopensource  wrote:
>> > Hi Rih,
>> >
>> > You going to include either of the two field "bought" or "like" to per
>> > member/visitor OR a unique field per member / visitor?
>> >
>> > If it's one or two common fields are included then there will not be any
>> > impact in performance. If you want to include unique field then you need
>> to
>> > consider multi value field otherwise you certainly hit the wall.
>> >
>> > Regards
>> > Aditya
>> > www.findbestopensource.com
>> >
>> >
>> >
>> >
>> > On Thu, May 20, 2010 at 12:13 PM, Rih  wrote:
>> >
>> >> Has anybody done personalized search with Solr? I'm thinking of
>> including
>> >> fields such as "bought" or "like" per member/visitor via dynamic fields
>> to
>> >> a
>> >> product search schema. Another option is to have a multi-value field
>> that
>> >> can contain user IDs. What are the possible performance issues with
>> >> this
>> >> setup?
>> >>
>> >> Looking forward to your ideas.
>> >>
>> >> Rih
>> >>
>> >
>>
>> --
>> Sent from my mobile device
>>
>

-- 
Sent from my mobile device


Re: Personalized Search

2010-05-21 Thread dc tech
In our specific case, we would get the user's folders and then do a
function query that provides a boost if the document.folder is in {my
folder list}.

Another approach that will work for our intranet use is to add the
userids in a multi-valued field as others have suggested.



On 5/20/10, MitchK  wrote:
>
> Hi dc,
>
>
>
>> - at query time, specify boosts for 'my items' items
>>
> Do you mean something like document-boost or do you want to include
> something like
> "OR myItemId:100^100"
> ?
>
> Can you tell us how you would specify document-boostings at query-time? Or
> are you querying something like a boolean field (i.e. isFavorite:true^10) or
> a numeric field?
>
> Kind regards
> - Mitch
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Personalized-Search-tp831070p832062.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Sent from my mobile device


Re: Personalized Search

2010-05-20 Thread dc tech
Another approach would be to do query time boosts of 'my' items under
the assumption that count is limited:
- keep the SOLR index independent of bought/like
- have a db table with user prefs on a per item basis
- at query time, specify boosts for 'my items' items

We are planning to do this in the context of document management where
documents in 'my (used/favorited ) folders' provide a boost factor
to the results.



On 5/20/10, findbestopensource  wrote:
> Hi Rih,
>
> You going to include either of the two field "bought" or "like" to per
> member/visitor OR a unique field per member / visitor?
>
> If it's one or two common fields are included then there will not be any
> impact in performance. If you want to include unique field then you need to
> consider multi value field otherwise you certainly hit the wall.
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
>
>
> On Thu, May 20, 2010 at 12:13 PM, Rih  wrote:
>
>> Has anybody done personalized search with Solr? I'm thinking of including
>> fields such as "bought" or "like" per member/visitor via dynamic fields to
>> a
>> product search schema. Another option is to have a multi-value field that
>> can contain user IDs. What are the possible performance issues with this
>> setup?
>>
>> Looking forward to your ideas.
>>
>> Rih
>>
>

-- 
Sent from my mobile device


SOLR Based Search - Response Times - what do you consider slow or fast?

2010-05-04 Thread dc tech
We are using SOLR in a production setup with a jRuby  on Rails front end
with about  20 different instances of SOLR running on heavy duty hardware.
The setup is load balanced front end (jRoR) on a pair of machines and the
SOLR backends on a different machine. We have plenty of memory and CPU and
the machines are not particularly loaded (<5% CPUs). Loads are in the range
of 12,000 to 16,000 searches a day so not a huge number. Our overall
response  (front end + SOLR) averages 0.5s to 0.7s with SOLR typicall taking
about 100 - 300 ms.

How does this compare with your experience? Would you say the performance is
good/bad/ugly?


Re: Score cutoff

2010-05-04 Thread dc tech
Michael,
The cutoff filter would be very useful for us as well. We want to use
it for more like this feature where only the top n similar docs tend
to be reallt similar.



On 5/4/10, Michael Kuhlmann  wrote:
> Am 03.05.2010 23:32, schrieb Satish Kumar:
>> Hi,
>>
>> Can someone give clues on how to implement this feature? This is a very
>> important requirement for us, so any help is greatly appreciated.
>>
>
> Hi,
>
> I just implemented exactly this feature. You need to patch Solr to make
> this work.
>
> We at Zalando are planning to set up a technology blog where we'll offer
> such tools, but at the moment this is not done. I can make a patch out
> of my work and send it to you today.
>
> Greetings,
> Michael
>
>> On Tue, Apr 27, 2010 at 5:54 PM, Satish Kumar <
>> satish.kumar.just.d...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> For some of our queries, the top xx (five or so) results are of very high
>>> quality and results after xx are very poor. The difference in score for
>>> the
>>> high quality and poor quality results is high. For example, 3.5 for high
>>> quality and 0.8 for poor quality. We want to exclude results with score
>>> value that is less than 60% or so of the first result. Is there a filter
>>> that does this? If not, can someone please give some hints on how to
>>> implement this (we want to do this as part of solr relevance ranking so
>>> that
>>> the facet counts, etc will be correct).
>>>
>>>
>>> Thanks,
>>> Satish
>>>
>>
>
>

-- 
Sent from my mobile device