Using facets to narrow results with multiword field

2009-12-11 Thread Tomasz Kępski

Hi,

I'm trying to prepare narrow you search functionality using facets. I 
do have some products and would like to use a brand as a narrow filter.


I did prepare in schema 2 fileds:

   fieldType name=brand_string class=solr.TextField 
sortMissingLast=true

omitNorms=true positionIncrementGap=100
analyzer
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.TrimFilterFactory /
 /analyzer
   /fieldType

   fieldType name=lower_string class=solr.TextField 
sortMissingLast=true

omitNorms=true positionIncrementGap=100
analyzer
tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory /
   filter class=solr.TrimFilterFactory /
 /analyzer
   /fieldType


  field name=brand type=brand_string indexed=true stored=true 
default=

none/
  field name=lbrand type=lower_string indexed=true 
stored=false defaul

t=none/

copyField source=brand dest=lbrand/

I'm using facet.field=lbrand and do get good results for eg: Geomax, 
GeoMax, GEOMAX  all of them falls into geomax. But when I'm filtering 
I do get strange results:


brand:geomax  gives numFound=0
lbrand:geomax  gives numFound=57 (GEOMAX, GeoMag, Geomag)

How should I redefine brand to let narrow work correctly?

Tomek


Re: Using facets to narrow results with multiword field

2009-12-11 Thread Tomasz Kępski

Correction:

I'm using facet.field=lbrand and do get good results for eg: Geomag, 
GeoMag, GEOMAG  all of them falls into geomag. But when I'm filtering 
I do get strange results:


brand:geomag  gives numFound=0
lbrand:geomag  gives numFound=57 (GEOMAG, GeoMag, Geomag)

How should I redefine brand to let narrow work correctly?


Of course all of the words are the same (only case is different)

TK


Re: Huge load and long response times during search

2009-11-24 Thread Tomasz Kępski

Hi,


: I'm using SOLR(1.4) to search among about 3,500,000 documents. After the
: server kernel was updated to 64bit system has started to suffer.

...if the *only* thing that was upgraded was switching the kernel from 
32bit to 64bit, then perhaps you are getting bit by java now using 64 bit 
pointers instead of 32 bit pointers, causing a lot more ram to be eaten up 
by the pointers?


it's not soemthing i've done a lot of testing on, but i've heared other 
people claim that it can cause some serious problems if you don't actaully 
need 64bit pointers for accessing huge heaps.


...that said, you should really double check what exactly what changed 
when your server was upgraded ... perhaps the upgrad inlcuded a new 
filesystem type, or changes to RAID settings, or even hardware changes ... 
if your problems started when an upgrade took place, then looking into 
what exactly changed during hte upgrade should be your furst step.


The kernel was the only thing which was changed. There were no hardware 
update, nobody touch the filesystem as well. So now this is a 32bit 
Debian with 64bit kernel
I have heard from our admins that the previous kernel had a grsec patch 
which regural killed java processes with signal 11.


To find out if the SOLR is a single problem or is the coegzistence of 
other services at one machine we are going to move solr to another one 
(same configuration) which is low used (small php app providing data 
from memcache filled once per hour).


Tom


Re: Boost document base on field length

2009-11-24 Thread Tomasz Kępski

Hi,

I think i'm reading he question differently then Grant -- his suggestion 
applies when you are searching in the description field, and don't want 
documents with shorter descriptions to score higher when the same terms 
match the same number of times (the default behavior of lengthNorm)


my udnerstanding is that you want documents that don't have a description 
to score lower then documents that do -- and you might be querying against 
completely differnet fields (description might not even be indexed)


in that case there is no easy way to to achieve this with just the 
description field ... the easy thing to do is to index a boolean 
has_description field and then incorporate that into your query (or as 
the input to a function query)


You get my point Hoss. In my case long description = good value. And 
your intuition is amazing ;-) I do have a field which is not used in 
search at all (image url) but docs with image have for me greater value 
than without it.


I would add two fields then (boolean for photo and int for description 
length) fill them up during indexation and would play with them during 
the search.


Thanks,
Tom



Get one document from each category

2009-11-24 Thread Tomasz Kępski

Hi,

I have the following case:

In my index I do have documents categorized (category_id - int sortable 
field). I would like to get three top documents matching user query BUT 
each have to be from different category.:


for example from returned set (doc_id : category id):

1:1
2:1
3:1
4:2
5:1
6:2
7:3
8:4

I would like to get docs 1, 4 and 7.
Is that possible without quering 3 times? Often lot of (more than my 
limit) the docs at the beginning are from the same category.
I'm using PHP Apache Solr so I would like to avoid processing large sets 
of data in my PHP based application.


Tomek


Re: Huge load and long response times during search

2009-11-23 Thread Tomasz Kępski

Hi,

Otis Gospodnetic pisze:

Tom,

It looks like the machine might simply be running too many things. 
 If the load is around 1 when Solr is not running, and this is a 
dual-core server, it shows its already relatively busy (cca 50% idle).


The server is running the Postgresql and Apache/PHP as well, but without 
solr the server condition is more than good (load usually less than 1, 
sometimes , even dring rush hours we observed 1m load avg 0,68).


It is double dual core so load 1 means 25% am I right (4 cores)?

Your caches are not small, so I am guessing you either have to have a relatively big heap, or your heap is not large enough and it's the GC that's causing high CPU load.  


The java starts with Xmx3584m. Should that be fine for such cache 
settings? By the way I'm wondering if we need such caches. I did check 
query frequency for last 10 days (~7 unique users) and most frequent 
phrase appears ~150 times, and only 11 queries exists more than 100 
times. I did not count if user used the same query but goes to next page.


Is this worthy to keep quite big cache in this cas?


If you are seeing Solr causing lots of IO, that's a sign the box doesn't have 
enough memory for all those servers running comfortably on it.


We do have some free memory to use. Server has 8G RAM and mostly uses up 
to 6G, I haven't seen the swap used yet. I would try to give more RAM 
for java and use smaller cache to see if it would work.


Tom




Boost document base on field length

2009-11-23 Thread Tomasz Kępski

Hi,

I would like to boost documents with longer descriptions to move down 
documents with 0 length description,
I'm wondering if there is possibility to boost document basing on the 
field length while searching or the only way is to store field length as 
an int in a separate field while indexing?


Tom


Huge load and long response times during search

2009-11-20 Thread Tomasz Kępski

Hi,

I'm using SOLR(1.4) to search among about 3,500,000 documents. After the 
server kernel was updated to 64bit system has started to suffer.

Our server has 8G of RAM and double Intel Core 2 DUO.
We used to have average loads around 2-2,5. It was not as good as it 
should but as long HTTP response times was acceptable we do not care to 
much ;-)


Since few days avg loads are usually around 6, sometimes goes even to 
20. PHP, Mysql and Postgresql based application is rather fine, but when 
tries to access SOLR it takes ages to load page. In top java process 
(Jetty) takes 200-250% of CPU, iotop shows that most of the disk 
operations are done by SOLR threads as well.


When we do shut down Jetty load goes down to 1,5 or even less than 1.

My index has ~12G below is a part of my solrconf.xml:

query
   maxBooleanClauses1024/maxBooleanClauses
   filterCache
 class=solr.LRUCache
 size=16384
 initialSize=4096
 autowarmCount=4096/
   queryResultCache
 class=solr.LRUCache
 size=16384
 initialSize=4096
 autowarmCount=1024/
   documentCache
 class=solr.LRUCache
 size=16384
 initialSize=16384
 autowarmCount=0/
   enableLazyFieldLoadingtrue/enableLazyFieldLoading
   useFilterForSortedQuerytrue/useFilterForSortedQuery
   queryResultWindowSize40/queryResultWindowSize
   queryResultMaxDocsCached200/queryResultMaxDocsCached
   HashDocSet maxSize=3000 loadFactor=0.75/
   listener event=newSearcher class=solr.QuerySenderListener
 arr name=queries
   lst str name=qsolr/str str name=start0/str str 
name=rows10/str /lst
   lst str name=qsolr/str str name=sortprice/str str 
name=start0/str str name=rows10/str /lst
   lst str name=qsolr/str str 
name=sortrekomendacja/str str name=start0/str str 
name=rows10/str /lst
   lststr name=qstatic newSearcher warming query from 
solrconfig.xml/str/lst

 /arr
   /listener
   listener event=firstSearcher class=solr.QuerySenderListener
 arr name=queries
   lst str name=qfast_warm/str str name=start0/str 
str name=rows10/str /lst
   lststr name=qstatic firstSearcher warming query from 
solrconfig.xml/str/lst

 /arr
   /listener
   useColdSearcherfalse/useColdSearcher
/query

 requestHandler name=dismax class=solr.SearchHandler 
   lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qf
   name^90.0 scategory^450.0 brand^90.0 text^0.01 description^30
/str
str name=pf
/str
str name=bf
/str
str name=fl
   brand,description,id,name,price,score
/str
str name=mm
   4lt;100% 5lt;90%
/str
int name=ps100/int
str name=q.alt*:*/str
   /lst
 /requestHandler

sample query parameters from log looks like this:

2009-11-20 21:07:15 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select 
params={spellcheck=truewt=jsonrows=20json.nl=mapstart=520facet=truespellcheck.collate=truefl=id,name,description,preparation,url,shop_idq=cameraqt=dismaxversion=1.3hl.fl=name,description,atributes,brand,urlfacet.field=shop_idfacet.field=brandhl.fragsize=200spellcheck.count=5hl.snippets=3hl=true} 
hits=3784 status=0 QTime=83

2009-11-20 21:07:15 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/spellCheckCompRH 
params={spellcheck=truewt=jsonrows=20json.nl=mapstart=520facet=truespellcheck.collate=truefl=id,name,description,preparation,url,shop_idq=cameraqt=dismaxversion=1.3hl.fl=name,description,atributes,brand,urlfacet.field=shop_idfacet.field=brandhl.fragsize=200spellcheck.count=5hl.snippets=3hl=true} 
hits=3784 status=0 QTime=16


And at last the question ;-)
How to speed up the search?
Which parameters should I check first to find out what is the bottleneck?

Sorry for verbose entry but I would like to give as clear point of view 
as possible


Thanks in advance,
Tom