Re: vector model usage

2010-06-01 Thread Rebecca Watson
Hi, if you want to store word+value pairs then use lucene scoring to weight the words with higher vaules against them, you should look at using payloads and the DelimitedPayloadTokenFilter which lets you specify e.g. word1|value1 word2|value2 ... and the values are stored as payloads against the w

Re: Problem fetching number of occurrences

2010-06-01 Thread Rebecca Watson
hi when you are indexing, use termvectors org.apache.lucene.document.Field.TermVector set this in the Field object constructor when you create your Field objects at index time. i've never done it but i'm pretty sure these can be retrieved at search time using one of the IndexReader.getTermFreqVec

Re: how to extend Similarity in this situation?

2010-06-01 Thread Rebecca Watson
Hi Li Li If you want to support some query types and not others you should overide/extend the queryparser so that you throw an exception / makes a different query type instead. Similarity doesn't do the actual scoring, it's used by the Query classes (actually the Scorer implementation used by the

Re: Problem fetching number of occurrences

2010-06-01 Thread Rebecca Watson
Hi, i was looking at another post which had this presentation in - it has a nice section on termfreqvectors: http://www.cnlp.org/presentations/slides/advancedluceneeu.pdf bec :) On 2 June 2010 13:56, Rebecca Watson wrote: > hi > > when you are indexing, use te

Re: about norm

2010-06-02 Thread Rebecca Watson
There are index time boosts ie calculated at index time and search time boosts. The field f always relates to the field(s) that the term t appears in. My understanding is that-- Norm(t,d) includes the index time boosts for each field but I think t is only included in this calc in terms of field.ge

Re: retrieving Payload 3.0.1

2010-06-08 Thread Rebecca Watson
Hi aad, See the search.payload package if you want examples of reading in payloads at query time for scoring purposes, but returning the payload/ using it to highlight will require you to write more custom lucene classes. We work with synonyms too, but rather than store the synonym in payload lik

Re: retrieving Payload 3.0.1

2010-06-08 Thread Rebecca Watson
Eg 'institute' regardless of which term in the set matched. Ie we know the first term in the same position Was the original one, but leverage the in built Highlighter for simplicity (ESP in solr). bec :) Sent from my iPhone On 09/06/2010, at 10:32 AM, Rebecca Watson wrote: Hi aad

Re: Problem using TopFieldCollector

2010-06-11 Thread Rebecca Watson
hi, i had similar issues migrating to using the new collectors... we use a custom hitcollector too where we accessed document fields to aid in scoring docs. when migrating - i chose to extend the Collector class where: .collect method still extended pretty much as before in the new abstract met

Re: Problem using TopFieldCollector

2010-06-11 Thread Rebecca Watson
discusses global docid/docid for current index in the example: http://lucene.apache.org/java/2_9_0/api/all/index.html bec :) On 12 June 2010 10:52, Rebecca Watson wrote: > hi, > > i had similar issues migrating to using the new collectors... we use a custom > hitcollector too where

Re: Stop words filter

2010-06-22 Thread Rebecca Watson
i guess you are using lucene 2.9 or below if you're talking about Tokens still... here's some old code i used to use (not sure if i wrote it or grabbed it from online examples - its been a while since i used it!) that grabbed the set of tokens given field name + text to analyse (for any class that

Re: Index with multiple level structure

2010-06-25 Thread Rebecca Watson
hi alex, sounds like you are going to tackle a similar problem to what we're trying to do in our XML too -- as it looks like you've got a one-to-many type relationship you want to search over but return based on the top-level document -- similar to an an XML i.e. structured doc search problem --

Re: How to manage resource out of index?

2010-07-06 Thread Rebecca Watson
hi li, i looked at doing something similar - where we only index the text but retrieve search results / highlight from files -- we ended up giving up because of the amount of customisation required in solr -- mainly because we wanted the distributed search functionality in solr which meant making

Re: Why not normalization?

2010-07-07 Thread Rebecca Watson
hi, > 1) Although Lucene uses tf to calculate scoring it seems to me that term > frequency has not been normalized. Even if I index several documents, it > does not normalize tf value. Therefore, since the total number of words > in index documents are varied, can't there be a fault in Lucene's sc