Re: Other vector similarity metric than provided by VectorSimilarityFunction

2023-01-15 Thread Michael Wechner
Am 15.01.23 um 16:36 schrieb Michael Sokolov: I would suggest building Lucene from source and adding your own similarity function to VectorSimilarity. That is the proper extension point for similarity functions. If you find there is some substantial benefit, it wouldn't be a big lift t

Re: Other vector similarity metric than provided by VectorSimilarityFunction

2023-01-15 Thread Michael Sokolov
I would suggest building Lucene from source and adding your own similarity function to VectorSimilarity. That is the proper extension point for similarity functions. If you find there is some substantial benefit, it wouldn't be a big lift to add something like that. However I'm dubious

Re: Other vector similarity metric than provided by VectorSimilarityFunction

2023-01-14 Thread Michael Wechner
vectors format that ignores the vector similarity configured on the field and uses its own. Le sam. 14 janv. 2023, 21:33, Michael Wechner a écrit : Hi IIUC Lucene currently supports VectorSimilarityFunction.COSINE VectorSimilarityFunction.DOT_PRODUCT VectorSimilarityFunction.EUCLIDEAN whereas s

Re: Other vector similarity metric than provided by VectorSimilarityFunction

2023-01-14 Thread Adrien Grand
Hi Michael, You could create a custom KNN vectors format that ignores the vector similarity configured on the field and uses its own. Le sam. 14 janv. 2023, 21:33, Michael Wechner a écrit : > Hi > > IIUC Lucene currently supports > > VectorSimilarity

Other vector similarity metric than provided by VectorSimilarityFunction

2023-01-14 Thread Michael Wechner
Hi IIUC Lucene currently supports VectorSimilarityFunction.COSINE VectorSimilarityFunction.DOT_PRODUCT VectorSimilarityFunction.EUCLIDEAN whereas some embedding models have been trained with other metrics. Also see https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdi

Re: The current default similarity implementation of Lucene is BM25, right?

2022-11-23 Thread Michael Wechner
o ask whether the current default similarity implementation of Lucene is really BM25, right? as described at https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ Thanks Michael ---

Re: The current default similarity implementation of Lucene is BM25, right?

2022-11-23 Thread Michael Wechner
don't write anything wrong I would like to ask whether the current default similarity implementation of Lucene is really BM25, right? as described at https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ Thanks Mi

Re: The current default similarity implementation of Lucene is BM25, right?

2022-11-23 Thread Adrien Grand
> would like to ask > > whether the current default similarity implementation of Lucene is > really BM25, right? > > as described at > > https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-

The current default similarity implementation of Lucene is BM25, right?

2022-11-23 Thread Michael Wechner
Hi On the Lucene FAQ there is no mentioning re tf-idf or bm25 and I would like to add some notes, but to be sure I don't write anything wrong I would like to ask whether the current default similarity implementation of Lucene is really BM25, right? as described at

Re: Fuzzy Query Similarity

2022-07-12 Thread Mike Drob
have enough information yet to say if this is expected in the application > or not, but it explains how we get the scores so there's something > satisfying about at least that bit. > > As a hacky idea, I tried changing the boost in FuzzyTermsEnum from that > computed similarity to

Re: Fuzzy Query Similarity

2022-07-11 Thread Mike Drob
e" is 80% similar because it's a 5 character term with a single edit (1/5). I don't have enough information yet to say if this is expected in the application or not, but it explains how we get the scores so there's something satisfying about at least that bit. As a hacky id

Re: Fuzzy Query Similarity

2022-07-09 Thread Michael Sokolov
Oh good! Thanks for clarifying, Uwe On Sat, Jul 9, 2022, 12:23 PM Uwe Schindler wrote: > Hi > > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact > > matches, or even to incorporate the edit distance more generally into > > the per-term score, although it does seem like that wou

Re: Fuzzy Query Similarity

2022-07-09 Thread Uwe Schindler
Hi FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact matches, or even to incorporate the edit distance more generally into the per-term score, although it does seem like that would be something people would generally expect. Actually it does this: * By default FuzzyQuery uses

Re: Fuzzy Query Similarity

2022-07-09 Thread Uwe Schindler
y; if you get a term "foo", you could maybe search for "foo OR foo~" ? On Fri, Jul 8, 2022 at 4:14 PM Mike Drob wrote: Hi folks, I'm working with some fuzzy queries and trying my best to understand what is the expected behaviour of the searcher. I'm not sure if this i

Re: Fuzzy Query Similarity

2022-07-09 Thread Michael Sokolov
t; > I'm working with some fuzzy queries and trying my best to understand what > is the expected behaviour of the searcher. I'm not sure if this is a > similarity bug or an incorrect usage on my end. > > The problem is when I do a fuzzy search for a term "spark~" then

Fuzzy Query Similarity

2022-07-08 Thread Mike Drob
Hi folks, I'm working with some fuzzy queries and trying my best to understand what is the expected behaviour of the searcher. I'm not sure if this is a similarity bug or an incorrect usage on my end. The problem is when I do a fuzzy search for a term "spark~" then instead o

Re: Providing weights for individual terms in a query based on similarity to document terms

2020-07-03 Thread Ali Akhtar
I think what I'm looking for is to multiply the term frequency of each term by the similarity score. E.g for 'shoes', its an exact match, so tf * 1 For 'socks', similarity = 0.8, -> tf * 0.8 'Clothes', similarity = 0.65 -> tf * 0.65 Is there a way to ach

Providing weights for individual terms in a query based on similarity to document terms

2020-07-03 Thread Ali Akhtar
Hellooo, Suppose a user enters ‘box of shoes’ in my search box. I have two documents titled ‘box of clothes’ and ‘box of socks’. I’ve figured out through a separate algorithm that ‘socks’ is more similar to ‘shoes’ than clothes. I even have a numeric score for the similarity: for socks it’s 0.8

Re: Payload TFIDF Similarity in Lucene 7.1.0

2018-03-14 Thread Michael Sokolov
ing into > payloads) > > in there, in place of the term frequency. > > > > On Mar 13, 2018 6:57 AM, "Erik Hatcher" wrote: > > > > > Payloads are only scored from certain query types. What query are you > > > executing? > > > > >

Re: Payload TFIDF Similarity in Lucene 7.1.0

2018-03-13 Thread Erdan Genc
e term frequency. > > On Mar 13, 2018 6:57 AM, "Erik Hatcher" wrote: > > > Payloads are only scored from certain query types. What query are you > > executing? > > > > > On Mar 13, 2018, at 04:58, Grdan Eenc > wrote: > > > > &g

Re: Payload TFIDF Similarity in Lucene 7.1.0

2018-03-13 Thread Michael Sokolov
types. What query are you > executing? > > > On Mar 13, 2018, at 04:58, Grdan Eenc wrote: > > > > Hej there, > > > > I want to extend the TFIDF Similarity class such that the term frequency > is > > neglected and the value in the payload used instead.

Re: Payload TFIDF Similarity in Lucene 7.1.0

2018-03-13 Thread Erik Hatcher
Payloads are only scored from certain query types. What query are you executing? > On Mar 13, 2018, at 04:58, Grdan Eenc wrote: > > Hej there, > > I want to extend the TFIDF Similarity class such that the term frequency is > neglected and the value in the payload used in

Payload TFIDF Similarity in Lucene 7.1.0

2018-03-13 Thread Grdan Eenc
Hej there, I want to extend the TFIDF Similarity class such that the term frequency is neglected and the value in the payload used instead. Therefore I basically do this: @Override public float tf(float freq) { return 1f; } public float scorePayload(int doc, int start

Re: Custom Similarity

2018-02-08 Thread Erick Erickson
order to activate payloads during scoring, you need to do two separate > things at the same time: > * use a payload aware query type: org.apache.lucene.queries.payloads.* > * use payload aware similarity > > Here is an old post that might inspire you : > https://lucidworks

Re: Custom Similarity

2018-02-08 Thread Ahmet Arslan
Hi Roy, In order to activate payloads during scoring, you need to do two separate things at the same time: * use a payload aware query type: org.apache.lucene.queries.payloads.* * use payload aware similarity Here is an old post that might inspire you :  https://lucidworks.com/2009/08/05

Re: Custom Similarity

2018-01-27 Thread Dwaipayan Roy
Thanks for your replies. But still, I am not sure about the way to do the thing. Can you please provide me with an example code snippet or, link to some page where I can find one? Thanks.. On Tue, Jan 16, 2018 at 3:28 PM, Dwaipayan Roy wrote: > ​I want to make a scoring function that will score

Re: Custom Similarity

2018-01-16 Thread Adrien Grand
If you are working with payloads, you will also want to have a look at PayloadScoreQuery. Le mar. 16 janv. 2018 à 12:26, Michael Sokolov a écrit : > Have a look at Expressions class. It compiles JavaScript that can reference > other values and can be used for ranking. > > On Jan 16, 2018 4:58 AM

Re: Custom Similarity

2018-01-16 Thread Michael Sokolov
Have a look at Expressions class. It compiles JavaScript that can reference other values and can be used for ranking. On Jan 16, 2018 4:58 AM, "Dwaipayan Roy" wrote: > ​I want to make a scoring function that will score the documents by the > following function: > given Q = {q1, q2, ... } > score

Custom Similarity

2018-01-16 Thread Dwaipayan Roy
​I want to make a scoring function that will score the documents by the following function: given Q = {q1, q2, ... } score(D,Q) = for all qi: SUM of { LOG { weight_1(qi) + weight_2(qi) + weight_3(qi) } } I have stored weight_1, weight_2 and weight_3 for all term of all docu

Re: Altering Term Frequency in Similarity

2016-12-15 Thread Robert Muir
Maybe have a look at SynonymQuery: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/SynonymQuery.java I think it does a similar thing to what you want, it sums up the frequencies of the synonyms and passes that sum to the similarity class as TF. On

Altering Term Frequency in Similarity

2016-12-14 Thread Mossaab Bagdouri
s given to the similarity class by score(int doc, float freq). Which class does provide that freq? Or what can I change to provide a different freq value, practically changing the document representation (e.g., freq[0] = freq[0] + freq[1]; freq[1] = 0); Regards, Mossaab

Setting LMJelinekMercer Similarity in Luke

2016-07-20 Thread Dwaipayan Roy
​Hello. I want to set LMJelinekMercer Similarity (with lambda set to, say, 0.6) for the Luke similarity calculation. Luke by default use the DefaultSimilarity. Can​ anyone help with this? I use Lucene 4.10.4 and Luke for that version of Lucene index. Dwaipayan.. ​

Re: Similarity Implementation

2016-07-07 Thread Đạt Cao Mạnh
Hi Siraj, I think https://lucene.apache.org/core/6_1_0/core/index.html?org/apache/lucene/search/ConstantScoreQuery.html should be good enough. On Fri, Jul 8, 2016 at 12:27 AM Siraj Haider wrote: > We are in the process of upgrading from 2.x to 6.x. In 2.x we implemented > our own simi

Similarity Implementation

2016-07-07 Thread Siraj Haider
We are in the process of upgrading from 2.x to 6.x. In 2.x we implemented our own similarity where all the functions return 1.0f, how can we implement such thing in 6.x? Is there an implementation already there that we can use and have the same results? -- Regards -Siraj Haider (212) 306

Re: Simple Similarity Implementation to Count the Number of Hits

2016-05-12 Thread Ahmet Arslan
Hi Luis, Thats an interesting question. Can you share your similarity? I suspect you return 1 expect Similarity#coord method. Not sure but, for phrase query, one may require to modify ExactPhraseScorer/ExactPhraseScorer etc. ahmet On Thursday, May 12, 2016 5:41 AM, Luís Filipe Nassif wrote

Simple Similarity Implementation to Count the Number of Hits

2016-05-11 Thread Luís Filipe Nassif
Hi, In the past (lucene 4) I have tried to implement a simple Similarity to only count the number of occurrences (term frequencies) into the documents, ignoring norms, doc frequencies, boosts... It worked for some queries like term and wildcard queries, but not for others, like phrase and range

Re: Access query length inside similarity

2015-11-03 Thread Ahmet Arslan
Hi, I only use BooleanQuery with TermQuery clauses. I found following methods that seems relevant to my need. There is a variable named maxOverlap, which is the total number of terms in the query. BooleanScorer's constructor has maxCoord variable Similarity#coord BooleanWeight#coord Ho

Access query length inside similarity

2015-10-27 Thread Ahmet Arslan
Hi, How can I access length of the query (number of words in the query) inside a SimilarityBase implementation? P.S. I am implementing multi-aspect TF [1] for an experimental study. So it does not have to be fast/optimized as production code. [1] http://dl.acm.org/citation.cfm?doid=2484028.2484

similarity per query

2015-10-08 Thread Sheng
Let's say I have a boolean query "a AND b", is it possible I run the search for this boolean query with similarity "Sa" set for query "a", and similarity "Sb" set for query "b" ?

Getting cosine similarity of any given two Lucene 5.1 Documents using latest APIs

2015-07-11 Thread Nitish Nitish
Hi All, Greetings, Just started with Lucene 5.1 a month ago for my research. I have a set of documents indexed with term frequencies option enabled during indexing. For given any two documents, I would like to calculate their tfidf cosine similarity could you please point me to the right

access query term in similarity calcuation

2015-05-23 Thread Ahmet Arslan
Hi, I have a number of similarity implementation that extends SimilarityBase. I need to learn which term I am scoring inside the method : abstract float score(BasicStats stats, float freq, float docLen); What is the easiest way to access the query term that I am scoring in similarity class

Computing the similarity of documents

2015-05-21 Thread Fotis P
Hello everyone, My task at hand is to compute the pairwise cosine similarity between a list of documents. I first index all the documents with DOCS_AND_FREQS option, then I construct a query from every term of a document: Query query = parser.parse(document); making sure to use the same

Re: for check similarity of two sentences

2015-04-02 Thread Robust Links
sh jay wrote: > > > hi, > > I am second year undergraduate of University of Moratuwa,SriLanka.My > second > > year project I am doing Question answering system(Knowledge base).In this > > project i have to suggest similar question perviously asked by other > users.

Re: for check similarity of two sentences

2015-04-02 Thread Gimantha Bandara
I am second year undergraduate of University of Moratuwa,SriLanka.My second > year project I am doing Question answering system(Knowledge base).In this > project i have to suggest similar question perviously asked by other users. > I should find similarity of two Sentences in my appli

for check similarity of two sentences

2015-03-31 Thread hesh jay
hi, I am second year undergraduate of University of Moratuwa,SriLanka.My second year project I am doing Question answering system(Knowledge base).In this project i have to suggest similar question perviously asked by other users. I should find similarity of two Sentences in my application to

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-19 Thread danield
Update: I have implemented my own subclasses of QueryParser, BooleanQuery, BooleanScorer and Similarity to deal with this. I have been successful in getting the exact behaviour I want... when calling the .explain() method. However, the scores for some documents often differ when calling

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread Jack Krupansky
at documentation is wrong! Any ideas > how > to fix? > Daniel > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-ma

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread danield
Oh thanks Mike, it did say somewhere. I guess it wouldn't hurt to make that explanation more prominent, as I clearly missed it. Never mind, I am working on my own solution for this, through subclassing QueryParser, BooleanQuery, BooleanScorer, Similarity and a bunch of other classes. C

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread Michael Sokolov
On 1/15/15 11:23 AM, danield wrote: Hi Mike, Thank you for your reply. Yes, I had thought of this, but it is not a solution to my problem, and this is because the Term Frequency and therefore the results will still be wrong, as prepending or appending a string to the term will still make it a di

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread danield
different classes differently will lead to increased relevance of results. This also doesn't change the fact that documentation is wrong! Any ideas how to fix? Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-ho

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-14 Thread Michael Sokolov
"field:field1\:term1" query2="field:(field1\:term1 or field2\:term1)" -Mike On 1/13/15 2:24 PM, danield wrote: Hi all, I have found, much to my dismay, that the documentation on Lucene’s default similarity formula is very dangerously misleading. See it here: http://lucene.apac

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-13 Thread danield
Corrections: document2={field1:”term1”, field2:”term1”} Coord(query1,document2)= 1/1 = 1 (Doesn't affect the problem/observation) -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-qu

Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-13 Thread danield
Hi all, I have found, much to my dismay, that the documentation on Lucene’s default similarity formula is very dangerously misleading. See it here: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf Term Frequency (TF) counts are

Re: Payload and Similarity Function: Always same value

2014-10-30 Thread Erick Erickson
Payload feature to add meta information to tokens. I specifically add >>> weights (i.e. 0-100) to conceptual tags in order to use them to >>> overwrite the standard Lucene TF-IDF weighting. I am puzzled by the >>> behaviour of this and I believe there is something wrong w

Re: Payload and Similarity Function: Always same value

2014-10-30 Thread Ralf Bierig
o use them to overwrite the standard Lucene TF-IDF weighting. I am puzzled by the behaviour of this and I believe there is something wrong with the Similarity class, that I overwrote, but I cannot figure it out. I attach the complete code below for this exampe. When I run a query with it (e.g. &q

Re: Payload and Similarity Function: Always same value

2014-10-30 Thread Ralf Bierig
o use them to overwrite the standard Lucene TF-IDF weighting. I am puzzled by the behaviour of this and I believe there is something wrong with the Similarity class, that I overwrote, but I cannot figure it out. I attach the complete code below for this exampe. When I run a query with it (e.g. &q

Re: Payload and Similarity Function: Always same value

2014-10-30 Thread Michael Sokolov
g with the Similarity class, that I overwrote, but I cannot figure it out. I attach the complete code below for this exampe. When I run a query with it (e.g. "concept:red") I discover that each payload is always the first number that was passed through MyPayloadSimilarity (in the c

Payload and Similarity Function: Always same value

2014-10-30 Thread Ralf Bierig
believe there is something wrong with the Similarity class, that I overwrote, but I cannot figure it out. I attach the complete code below for this exampe. When I run a query with it (e.g. "concept:red") I discover that each payload is always the first number that was pass

How to use query terms tfidf as a factor in document similarity calculation

2014-05-18 Thread Diaa Abdallah
Hi, I'm trying to implement Explicit semantic analysis(ESA) via Lucene. How do I take a term TFIDF in a query into consideration when matching documents? For example: Query:"a b c a d a" Doc1:"a b a" Doc2:"a b c" The query should match Doc1 better than 2. I'd like this to work without impacting

Re: tf/idf similarity with modified document similarity

2014-03-07 Thread Jack Krupansky
@lucene.apache.org Subject: tf/idf similarity with modified document similarity -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, what is the best method to score documents similar to default similarity, but the document frequency should be calculated per query against the matching result document set

tf/idf similarity with modified document similarity

2014-03-06 Thread Christian Reuschling
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, what is the best method to score documents similar to default similarity, but the document frequency should be calculated per query against the matching result document set, not statically against the whole corpus. Didn't found a goo

Re: Changing similarity at query time

2013-12-09 Thread Ivan Brusic
To answer my own question, it appears that despite the warning, using a custom similarity only at search time appears to be working. The score() method was the wrong code to override, I simply hardcoded the return value of decodeNormValue to 1.0. Since this value is used for normalization, as long

Changing similarity at query time

2013-12-09 Thread Ivan Brusic
I am currently using document-level boosts, which really translates to changing the norm for every field under the covers. As part of an experiment, I want to remove the boost, but that would require either re-indexing content or changing the scoring algorithm (similarity). If I create my own

Similarity calc for NumericRangeQuery

2013-11-13 Thread Goutham Tholpadi
When I include a BooleanClause with a NumericRangeQuery, the results for MUST are different from those for SHOULD (as expected). My question is: In the case of SHOULD, is the NumericRangeQuery effectively ignored? Is there a similarity calculation based on how far the document's field val

RE: Lucene Text Similarity

2013-09-04 Thread Allison, Timothy B.
l.com] Sent: Wednesday, September 04, 2013 1:45 PM To: java-user@lucene.apache.org Subject: Re: Lucene Text Similarity Thanks to all, I will take into account your suggestions. But I think that should have given the concrete use case. Therefore, taking into account my first example given, I have the

Re: Lucene Text Similarity

2013-09-04 Thread David Miranda
st, > >Tim > > > From: Ivan Krišto [ivan.kri...@gmail.com] > Sent: Wednesday, September 04, 2013 3:17 AM > To: java-user@lucene.apache.org > Subject: Re: Lucene Text Similarity > > On 09/03/2013 07:33 PM, David Miranda wrote: > > Is there any wa

RE: Lucene Text Similarity

2013-09-04 Thread Allison, Timothy B.
From: Ivan Krišto [ivan.kri...@gmail.com] Sent: Wednesday, September 04, 2013 3:17 AM To: java-user@lucene.apache.org Subject: Re: Lucene Text Similarity On 09/03/2013 07:33 PM, David Miranda wrote: Is there any way to check the similarity of texts with Lucene? I have the

Re: Lucene Text Similarity

2013-09-04 Thread Ivan Krišto
On 09/03/2013 07:33 PM, David Miranda wrote: Is there any way to check the similarity of texts with Lucene? I have the DBpedia indexed and wanted to get the texts more similar between the abstract and DBpedia another text. If I do a search in the abstract field, with a particular text the result

Re: Lucene Text Similarity

2013-09-03 Thread Koji Sekiguchi
(13/09/04 2:33), David Miranda wrote: Is there any way to check the similarity of texts with Lucene? I have the DBpedia indexed and wanted to get the texts more similar between the abstract and DBpedia another text. If I do a search in the abstract field, with a particular text the result is

Lucene Text Similarity

2013-09-03 Thread David Miranda
Is there any way to check the similarity of texts with Lucene? I have the DBpedia indexed and wanted to get the texts more similar between the abstract and DBpedia another text. If I do a search in the abstract field, with a particular text the result is not very satisfactory. Eg Abstract

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Shai Erera
ok that makes sense. Shai On Mon, Aug 12, 2013 at 9:18 PM, Robert Muir wrote: > On Mon, Aug 12, 2013 at 11:06 AM, Shai Erera wrote: > > > > Or, you'd like to keep FieldCache API for sort of back-compat with > existing > > features, and let the app control the "caching" by using an explicit >

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Robert Muir
On Mon, Aug 12, 2013 at 11:06 AM, Shai Erera wrote: > > Or, you'd like to keep FieldCache API for sort of back-compat with existing > features, and let the app control the "caching" by using an explicit > RamDVFormat? > Yes. In the future ideally fieldcache goes away and is a UninvertingFilterRea

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Shai Erera
t? > > > > Yes, exactly. its a little confusing, but a tradeoff to make docvalues > > work transparently with lots of existing code built off of fieldcache > > (sorting/grouping/joins/faceting/...) without having to have 2 > > separate implementations of what i

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Ross Woolf
ing/grouping/joins/faceting/...) without having to have 2 > separate implementations of what is the same thing. so its like > "docvalues is a fieldcache you already built at index-time". > > > > > Also, my similarity was extending SimilarityBase, and I can't see how

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Robert Muir
es is a fieldcache you already built at index-time". > > Also, my similarity was extending SimilarityBase, and I can't see how to > get a docId as it is not passed in the score method "score(BasicStats > stats, float freq, float docLen)". Will I need to extend using Simila

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Ross Woolf
Okay, just for clarity sake, what you are saying is that if I make the FieldCache call it won't actually create and impose the loading time of the FieldCache, but rather just use the NumericDocValuesField instead. Is this correct? Also, my similarity was extending SimilarityBase, and I can&

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Robert Muir
ng.java >> >> On Mon, Aug 12, 2013 at 10:43 AM, Ross Woolf wrote: >> > The JavaDocs for NumericDocValuesField indicates that this field value >> can >> > be used for scoring. The example shows how to store the field, but I am >> > unclear as to how to retrie

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Ross Woolf
lf wrote: > > The JavaDocs for NumericDocValuesField indicates that this field value > can > > be used for scoring. The example shows how to store the field, but I am > > unclear as to how to retrieve the value of the field while in a > similarity > > to use it when scoring

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Robert Muir
eld indicates that this field value can > be used for scoring. The example shows how to store the field, but I am > unclear as to how to retrieve the value of the field while in a similarity > to use it when scoring a document? Can someone point me to an example or > give me one that d

How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Ross Woolf
The JavaDocs for NumericDocValuesField indicates that this field value can be used for scoring. The example shows how to store the field, but I am unclear as to how to retrieve the value of the field while in a similarity to use it when scoring a document? Can someone point me to an example or

Re: raw cosine similarity

2013-07-21 Thread lukai
It's not hard to implement one. Store your term value of your document with payload. Then create your own Query and override the score function with your cosine similarity logic. The problem here is you need to watch out the performance, especially for terms have very high DF. It may dec

raw cosine similarity

2013-07-21 Thread Malgorzata Urbanska
Hi, I would like to calculate raw cosine similarity between query and document. I read documentation about lucene scoring but I'm still confused. Does exist any implementation in Luscen 4.3.0 to do that. If not, what is the easiest way to do this. So far I'm retrieving a TermVector fo

Cosine Similarity Using Two or More Terms`

2013-03-07 Thread Peter Lavin
Dear Users, I'm calculation cosine similarity between two documents using code based on the code at this link... http://sujitpal.blogspot.ch/2011/10/computing-document-similarity-using.html Is it working fine, but I want to use terms from two different fields in my indexed docu

Re: Getting a similarity score for an arbitrary pair of documents or a query and a document

2013-03-06 Thread Emmanuel Espina
Have you already checked Solr's more like this? http://wiki.apache.org/solr/MoreLikeThisHandler and http://wiki.apache.org/solr/MoreLikeThis Your describe a problem similar to the use case of that component and if there is something to hack is solr's more like this. Lucene's simi

Getting a similarity score for an arbitrary pair of documents or a query and a document

2013-03-06 Thread Michael O'Leary
Is there an api in Lucene for finding the similarity score for two documents that have been randomly pulled from an index? What about for a query and a randomly selected document? I realize this isn't the standard purpose of Lucene, but I was given a task to compare similarity scores fo

Re: Setting Similarity classes in Benchmark .alg scripts

2013-02-06 Thread Robert Muir
the performance (speed) ? In either case... patches welcome! On Mon, Feb 4, 2013 at 6:01 PM, Michael O'Leary wrote: > I'd like to compare the relevance scores that are returned when using the > Similarity classes that are available in Lucene 4.x, and it seems like > using the

Setting Similarity classes in Benchmark .alg scripts

2013-02-04 Thread Michael O'Leary
I'd like to compare the relevance scores that are returned when using the Similarity classes that are available in Lucene 4.x, and it seems like using the Benchmark component would be a good way to do that. It looks like the isn't currently a way to specify a Similarity class to use in

Re: Superset Similarity?

2012-11-16 Thread Robert Muir
On Fri, Nov 16, 2012 at 5:18 PM, Tom Burton-West wrote: > Hi Otis, > > I hope this is not off-topic, > > Apparently in Lucene similarity does not have to be set at index time: > Actually in the general case it does. IndexWriter calls the Similarity's computeNorm method

Re: Superset Similarity?

2012-11-16 Thread Tom Burton-West
Hi Otis, I hope this is not off-topic, Apparently in Lucene similarity does not have to be set at index time: See http://lucene.apache.org/core/4_0_0/changes/Changes.html under Lucene 2959 "All models default to the same index-time norm encoding as DefaultSimilarity, so you can easily try

Re: setting different similarity in config (.alg) file at indexing

2012-09-28 Thread Sachin Kulkarni
your index to the new format with IndexUpgrader first." So basically in my case I do not need to set it in the .alg file. On Wed, Sep 5, 2012 at 7:58 AM, Sachin Kulkarni wrote: > Hi, > > For Lucene core 4.0. BETA, under the search.similarities help page it says > the followin

setting different similarity in config (.alg) file at indexing

2012-09-05 Thread Sachin Kulkarni
Hi, For Lucene core 4.0. BETA, under the search.similarities help page it says the following "To change Similarity<http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/Similarity.html>, one must do so for both indexing and searching, and the changes

Re: Document Similarity

2012-07-30 Thread in.abdul
>> regardsshaimaa >> >> -- >> If you reply to this email, your message will be added to the >> discussion below: >> http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082.html >> To unsubscribe from Lucene, click >&g

RE: Document Similarity

2012-07-30 Thread Elshaimaa Ali
thank you so much for the prompt reply I need to extract a document from the index that is similar to an Html document, and I need to use cosine similarity or latent semantic analysis which means that I need to generate term vector for the html document, the link you sent me doesn't co

Re: Document Similarity

2012-07-30 Thread in.abdul
can use to map the document to one of the documents in > the index > regardsshaimaa > > -- > If you reply to this email, your message will be added to the discussion > below: > http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082.html

Different Weights to Lucene fields with Okapi Similarity

2012-07-16 Thread Kasun Perera
Resending again, since my question didn't get much attention -- Forwarded message -- From: Kasun Perera Date: Tue, Jun 19, 2012 at 3:26 PM Subject: Different Weights to Lucene fields with Okapi Similarity To: java-user@lucene.apache.org Based on this link http://www200

Different Weights to Lucene fields with Okapi Similarity

2012-06-19 Thread Kasun Perera
Based on this link http://www2002.org/CDROM/refereed/643/node6.html , I'm calculating Okapi similarity between the query document and another document as below using Lucene: I have indexed the documents using 3 fields. I want to give higher weight to field 2 and field 3. I can't us

Re: Grouping Based on Multiple Fields Similarity

2012-05-21 Thread Robby
Hi All, Sorry... I give wrong example, should be like this actually.. On Mon, May 21, 2012 at 9:31 PM, Robby wrote: > - Grouping 1, count : 3 > - row id = 1 > - row id = 23 > - row id = 100 > - Grouping 2 > - row id = 11 > - row id = 133 > - ... > Regards, Ro

Grouping Based on Multiple Fields Similarity

2012-05-21 Thread Robby
tering based on similarity between four of five fields. The end result would be something like this : - Grouping 1, count : 3 - row id = 1 - row id = 23 - row id = 100 - Grouping 2 - row id = 1 - row id = 23 - ... I have done some research and MoreLikeThis class can

Re: Better Way of calculating Cosine Similarity between documents

2012-05-18 Thread nemeskey . david
and their term frequencies by reading the index and calculate TF-IDF scores vector for each document. Then using TF-IDF vectors, I calculate pairwise cosine similarity between documents using the equation here http://en.wikipedia.org/wiki/Cosine_similarity. This is my problem Say I have two identi

Re: Better Way of calculating Cosine Similarity between documents

2012-05-18 Thread Akos Tajti
vector for each document. > Then using TF-IDF vectors, I calculate pairwise cosine similarity between > documents using the equation here > http://en.wikipedia.org/wiki/Cosine_similarity. > > This is my problem > > Say I have two identical documents “A” and “B” in this collection (A

  1   2   3   4   >