Am 15.01.23 um 16:36 schrieb Michael Sokolov:
I would suggest building Lucene from source and adding your own
similarity function to VectorSimilarity. That is the proper extension
point for similarity functions. If you find there is some substantial
benefit, it wouldn't be a big lift t
I would suggest building Lucene from source and adding your own
similarity function to VectorSimilarity. That is the proper extension
point for similarity functions. If you find there is some substantial
benefit, it wouldn't be a big lift to add something like that. However
I'm dubious
vectors format that ignores the vector
similarity configured on the field and uses its own.
Le sam. 14 janv. 2023, 21:33, Michael Wechner a
écrit :
Hi
IIUC Lucene currently supports
VectorSimilarityFunction.COSINE
VectorSimilarityFunction.DOT_PRODUCT
VectorSimilarityFunction.EUCLIDEAN
whereas s
Hi Michael,
You could create a custom KNN vectors format that ignores the vector
similarity configured on the field and uses its own.
Le sam. 14 janv. 2023, 21:33, Michael Wechner a
écrit :
> Hi
>
> IIUC Lucene currently supports
>
> VectorSimilarity
Hi
IIUC Lucene currently supports
VectorSimilarityFunction.COSINE
VectorSimilarityFunction.DOT_PRODUCT
VectorSimilarityFunction.EUCLIDEAN
whereas some embedding models have been trained with other metrics.
Also see
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdi
o ask
whether the current default similarity implementation of Lucene is
really BM25, right?
as described at
https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
Thanks
Michael
---
don't write anything wrong I
would like to ask
whether the current default similarity implementation of Lucene is
really BM25, right?
as described at
https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
Thanks
Mi
> would like to ask
>
> whether the current default similarity implementation of Lucene is
> really BM25, right?
>
> as described at
>
> https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-
Hi
On the Lucene FAQ there is no mentioning re tf-idf or bm25 and I would
like to add some notes, but to be sure I don't write anything wrong I
would like to ask
whether the current default similarity implementation of Lucene is
really BM25, right?
as described at
have enough information yet to say if this is expected in the application
> or not, but it explains how we get the scores so there's something
> satisfying about at least that bit.
>
> As a hacky idea, I tried changing the boost in FuzzyTermsEnum from that
> computed similarity to
e" is 80% similar
because it's a 5 character term with a single edit (1/5). I don't have enough
information yet to say if this is expected in the application or not, but it
explains how we get the scores so there's something satisfying about at least
that bit.
As a hacky id
Oh good! Thanks for clarifying, Uwe
On Sat, Jul 9, 2022, 12:23 PM Uwe Schindler wrote:
> Hi
> > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> > matches, or even to incorporate the edit distance more generally into
> > the per-term score, although it does seem like that wou
Hi
FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
matches, or even to incorporate the edit distance more generally into
the per-term score, although it does seem like that would be something
people would generally expect.
Actually it does this:
* By default FuzzyQuery uses
y; if you get a term "foo", you could maybe search for "foo
OR foo~" ?
On Fri, Jul 8, 2022 at 4:14 PM Mike Drob wrote:
Hi folks,
I'm working with some fuzzy queries and trying my best to understand what
is the expected behaviour of the searcher. I'm not sure if this i
t;
> I'm working with some fuzzy queries and trying my best to understand what
> is the expected behaviour of the searcher. I'm not sure if this is a
> similarity bug or an incorrect usage on my end.
>
> The problem is when I do a fuzzy search for a term "spark~" then
Hi folks,
I'm working with some fuzzy queries and trying my best to understand what
is the expected behaviour of the searcher. I'm not sure if this is a
similarity bug or an incorrect usage on my end.
The problem is when I do a fuzzy search for a term "spark~" then instead o
I think what I'm looking for is to multiply the term frequency of each term
by the similarity score.
E.g for 'shoes', its an exact match, so tf * 1
For 'socks', similarity = 0.8, -> tf * 0.8
'Clothes', similarity = 0.65 -> tf * 0.65
Is there a way to ach
Hellooo,
Suppose a user enters ‘box of shoes’ in my search box. I have two documents
titled ‘box of clothes’ and ‘box of socks’. I’ve figured out through a
separate algorithm that ‘socks’ is more similar to ‘shoes’ than clothes.
I even have a numeric score for the similarity: for socks it’s 0.8
ing into
> payloads)
> > in there, in place of the term frequency.
> >
> > On Mar 13, 2018 6:57 AM, "Erik Hatcher" wrote:
> >
> > > Payloads are only scored from certain query types. What query are you
> > > executing?
> > >
> >
e term frequency.
>
> On Mar 13, 2018 6:57 AM, "Erik Hatcher" wrote:
>
> > Payloads are only scored from certain query types. What query are you
> > executing?
> >
> > > On Mar 13, 2018, at 04:58, Grdan Eenc
> wrote:
> > >
> &g
types. What query are you
> executing?
>
> > On Mar 13, 2018, at 04:58, Grdan Eenc wrote:
> >
> > Hej there,
> >
> > I want to extend the TFIDF Similarity class such that the term frequency
> is
> > neglected and the value in the payload used instead.
Payloads are only scored from certain query types. What query are you
executing?
> On Mar 13, 2018, at 04:58, Grdan Eenc wrote:
>
> Hej there,
>
> I want to extend the TFIDF Similarity class such that the term frequency is
> neglected and the value in the payload used in
Hej there,
I want to extend the TFIDF Similarity class such that the term frequency is
neglected and the value in the payload used instead. Therefore I basically
do this:
@Override
public float tf(float freq) {
return 1f;
}
public float scorePayload(int doc, int start
order to activate payloads during scoring, you need to do two separate
> things at the same time:
> * use a payload aware query type: org.apache.lucene.queries.payloads.*
> * use payload aware similarity
>
> Here is an old post that might inspire you :
> https://lucidworks
Hi Roy,
In order to activate payloads during scoring, you need to do two separate
things at the same time:
* use a payload aware query type: org.apache.lucene.queries.payloads.*
* use payload aware similarity
Here is an old post that might inspire you :
https://lucidworks.com/2009/08/05
Thanks for your replies. But still, I am not sure about the way to do the
thing. Can you please provide me with an example code snippet or, link to
some page where I can find one?
Thanks..
On Tue, Jan 16, 2018 at 3:28 PM, Dwaipayan Roy
wrote:
> I want to make a scoring function that will score
If you are working with payloads, you will also want to have a look at
PayloadScoreQuery.
Le mar. 16 janv. 2018 à 12:26, Michael Sokolov a
écrit :
> Have a look at Expressions class. It compiles JavaScript that can reference
> other values and can be used for ranking.
>
> On Jan 16, 2018 4:58 AM
Have a look at Expressions class. It compiles JavaScript that can reference
other values and can be used for ranking.
On Jan 16, 2018 4:58 AM, "Dwaipayan Roy" wrote:
> I want to make a scoring function that will score the documents by the
> following function:
> given Q = {q1, q2, ... }
> score
I want to make a scoring function that will score the documents by the
following function:
given Q = {q1, q2, ... }
score(D,Q) =
for all qi:
SUM of {
LOG { weight_1(qi) + weight_2(qi) + weight_3(qi) }
}
I have stored weight_1, weight_2 and weight_3 for all term of all docu
Maybe have a look at SynonymQuery:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/SynonymQuery.java
I think it does a similar thing to what you want, it sums up the
frequencies of the synonyms and passes that sum to the similarity
class as TF.
On
s
given to the similarity class by score(int doc, float freq). Which class
does provide that freq? Or what can I change to provide a different freq
value, practically changing the document representation (e.g., freq[0] =
freq[0] + freq[1]; freq[1] = 0);
Regards,
Mossaab
Hello.
I want to set LMJelinekMercer Similarity (with lambda set to, say, 0.6) for
the Luke similarity calculation. Luke by default use the DefaultSimilarity.
Can anyone help with this? I use Lucene 4.10.4 and Luke for that version
of Lucene index.
Dwaipayan..
Hi Siraj,
I think
https://lucene.apache.org/core/6_1_0/core/index.html?org/apache/lucene/search/ConstantScoreQuery.html
should be good enough.
On Fri, Jul 8, 2016 at 12:27 AM Siraj Haider wrote:
> We are in the process of upgrading from 2.x to 6.x. In 2.x we implemented
> our own simi
We are in the process of upgrading from 2.x to 6.x. In 2.x we implemented our
own similarity where all the functions return 1.0f, how can we implement such
thing in 6.x? Is there an implementation already there that we can use and have
the same results?
--
Regards
-Siraj Haider
(212) 306
Hi Luis,
Thats an interesting question. Can you share your similarity?
I suspect you return 1 expect Similarity#coord method.
Not sure but, for phrase query, one may require to modify
ExactPhraseScorer/ExactPhraseScorer etc.
ahmet
On Thursday, May 12, 2016 5:41 AM, Luís Filipe Nassif
wrote
Hi,
In the past (lucene 4) I have tried to implement a simple Similarity to
only count the number of occurrences (term frequencies) into the documents,
ignoring norms, doc frequencies, boosts... It worked for some queries like
term and wildcard queries, but not for others, like phrase and range
Hi,
I only use BooleanQuery with TermQuery clauses.
I found following methods that seems relevant to my need.
There is a variable named maxOverlap, which is the total number of terms in the
query.
BooleanScorer's constructor has maxCoord variable
Similarity#coord
BooleanWeight#coord
Ho
Hi,
How can I access length of the query (number of words in the query) inside a
SimilarityBase implementation?
P.S. I am implementing multi-aspect TF [1] for an experimental study.
So it does not have to be fast/optimized as production code.
[1] http://dl.acm.org/citation.cfm?doid=2484028.2484
Let's say I have a boolean query "a AND b", is it possible I run the search
for this boolean query with similarity "Sa" set for query "a", and
similarity "Sb" set for query "b" ?
Hi All,
Greetings,
Just started with Lucene 5.1 a month ago for my research. I have a set
of documents indexed with term frequencies option enabled during indexing.
For given any two documents, I would like to calculate their tfidf cosine
similarity could you please point me to the right
Hi,
I have a number of similarity implementation that extends SimilarityBase.
I need to learn which term I am scoring inside the method :
abstract float score(BasicStats stats, float freq, float docLen);
What is the easiest way to access the query term that I am scoring in
similarity class
Hello everyone,
My task at hand is to compute the pairwise cosine similarity between a list
of documents.
I first index all the documents with DOCS_AND_FREQS option, then I
construct a query from every term of a document:
Query query = parser.parse(document);
making sure to use the same
sh jay wrote:
>
> > hi,
> > I am second year undergraduate of University of Moratuwa,SriLanka.My
> second
> > year project I am doing Question answering system(Knowledge base).In this
> > project i have to suggest similar question perviously asked by other
> users.
I am second year undergraduate of University of Moratuwa,SriLanka.My second
> year project I am doing Question answering system(Knowledge base).In this
> project i have to suggest similar question perviously asked by other users.
> I should find similarity of two Sentences in my appli
hi,
I am second year undergraduate of University of Moratuwa,SriLanka.My second
year project I am doing Question answering system(Knowledge base).In this
project i have to suggest similar question perviously asked by other users.
I should find similarity of two Sentences in my application to
Update: I have implemented my own subclasses of QueryParser, BooleanQuery,
BooleanScorer and Similarity to deal with this.
I have been successful in getting the exact behaviour I want... when
calling the .explain() method. However, the scores for some documents often
differ when calling
at documentation is wrong! Any ideas
> how
> to fix?
> Daniel
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-ma
Oh thanks Mike, it did say somewhere. I guess it wouldn't hurt to make that
explanation more prominent, as I clearly missed it.
Never mind, I am working on my own solution for this, through subclassing
QueryParser, BooleanQuery, BooleanScorer, Similarity and a bunch of other
classes.
C
On 1/15/15 11:23 AM, danield wrote:
Hi Mike,
Thank you for your reply. Yes, I had thought of this, but it is not a
solution to my problem, and this is because the Term Frequency and therefore
the results will still be wrong, as prepending or appending a string to the
term will still make it a di
different classes differently will lead to increased relevance
of results.
This also doesn't change the fact that documentation is wrong! Any ideas how
to fix?
Daniel
--
View this message in context:
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-ho
"field:field1\:term1"
query2="field:(field1\:term1 or field2\:term1)"
-Mike
On 1/13/15 2:24 PM, danield wrote:
Hi all,
I have found, much to my dismay, that the documentation on Lucene’s default
similarity formula is very dangerously misleading. See it here:
http://lucene.apac
Corrections:
document2={field1:”term1”, field2:”term1”}
Coord(query1,document2)= 1/1 = 1
(Doesn't affect the problem/observation)
--
View this message in context:
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-qu
Hi all,
I have found, much to my dismay, that the documentation on Lucene’s default
similarity formula is very dangerously misleading. See it here:
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf
Term Frequency (TF) counts are
Payload feature to add meta information to tokens. I specifically add
>>> weights (i.e. 0-100) to conceptual tags in order to use them to
>>> overwrite the standard Lucene TF-IDF weighting. I am puzzled by the
>>> behaviour of this and I believe there is something wrong w
o use them to
overwrite the standard Lucene TF-IDF weighting. I am puzzled by the
behaviour of this and I believe there is something wrong with the
Similarity class, that I overwrote, but I cannot figure it out.
I attach the complete code below for this exampe. When I run a query
with it (e.g. &q
o use them to
overwrite the standard Lucene TF-IDF weighting. I am puzzled by the
behaviour of this and I believe there is something wrong with the
Similarity class, that I overwrote, but I cannot figure it out.
I attach the complete code below for this exampe. When I run a query
with it (e.g. &q
g with the
Similarity class, that I overwrote, but I cannot figure it out.
I attach the complete code below for this exampe. When I run a query
with it (e.g. "concept:red") I discover that each payload is always
the first number that was passed through MyPayloadSimilarity (in the
c
believe there is something wrong with the Similarity class, that I
overwrote, but I cannot figure it out.
I attach the complete code below for this exampe. When I run a query
with it (e.g. "concept:red") I discover that each payload is always the
first number that was pass
Hi,
I'm trying to implement Explicit semantic analysis(ESA) via Lucene.
How do I take a term TFIDF in a query into consideration when matching
documents?
For example:
Query:"a b c a d a"
Doc1:"a b a"
Doc2:"a b c"
The query should match Doc1 better than 2.
I'd like this to work without impacting
@lucene.apache.org
Subject: tf/idf similarity with modified document similarity
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hello,
what is the best method to score documents similar to default similarity,
but the document
frequency should be calculated per query against the matching result
document set
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hello,
what is the best method to score documents similar to default similarity, but
the document
frequency should be calculated per query against the matching result document
set, not statically
against the whole corpus.
Didn't found a goo
To answer my own question, it appears that despite the warning, using a
custom similarity only at search time appears to be working. The score()
method was the wrong code to override, I simply hardcoded the return value
of decodeNormValue to 1.0. Since this value is used for normalization, as
long
I am currently using document-level boosts, which really translates to
changing the norm for every field under the covers. As part of an
experiment, I want to remove the boost, but that would require either
re-indexing content or changing the scoring algorithm (similarity).
If I create my own
When I include a BooleanClause with a NumericRangeQuery, the results for
MUST are different from those for SHOULD (as expected). My question is:
In the case of SHOULD, is the NumericRangeQuery effectively ignored? Is
there a similarity calculation based on how far the document's field val
l.com]
Sent: Wednesday, September 04, 2013 1:45 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene Text Similarity
Thanks to all, I will take into account your suggestions.
But I think that should have given the concrete use case. Therefore,
taking into account my first example given, I have the
st,
>
>Tim
>
>
> From: Ivan Krišto [ivan.kri...@gmail.com]
> Sent: Wednesday, September 04, 2013 3:17 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene Text Similarity
>
> On 09/03/2013 07:33 PM, David Miranda wrote:
>
> Is there any wa
From: Ivan Krišto [ivan.kri...@gmail.com]
Sent: Wednesday, September 04, 2013 3:17 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene Text Similarity
On 09/03/2013 07:33 PM, David Miranda wrote:
Is there any way to check the similarity of texts with Lucene? I have the
On 09/03/2013 07:33 PM, David Miranda wrote:
Is there any way to check the similarity of texts with Lucene? I have the
DBpedia indexed and wanted to get the texts more similar between the
abstract and DBpedia another text. If I do a search in the abstract field,
with a particular text the result
(13/09/04 2:33), David Miranda wrote:
Is there any way to check the similarity of texts with Lucene?
I have the DBpedia indexed and wanted to get the texts more similar
between the abstract and DBpedia another text. If I do a search in the
abstract field, with a particular text the result is
Is there any way to check the similarity of texts with Lucene?
I have the DBpedia indexed and wanted to get the texts more similar
between the abstract and DBpedia another text. If I do a search in the
abstract field, with a particular text the result is not very
satisfactory. Eg
Abstract
ok that makes sense.
Shai
On Mon, Aug 12, 2013 at 9:18 PM, Robert Muir wrote:
> On Mon, Aug 12, 2013 at 11:06 AM, Shai Erera wrote:
> >
> > Or, you'd like to keep FieldCache API for sort of back-compat with
> existing
> > features, and let the app control the "caching" by using an explicit
>
On Mon, Aug 12, 2013 at 11:06 AM, Shai Erera wrote:
>
> Or, you'd like to keep FieldCache API for sort of back-compat with existing
> features, and let the app control the "caching" by using an explicit
> RamDVFormat?
>
Yes. In the future ideally fieldcache goes away and is a
UninvertingFilterRea
t?
> >
> > Yes, exactly. its a little confusing, but a tradeoff to make docvalues
> > work transparently with lots of existing code built off of fieldcache
> > (sorting/grouping/joins/faceting/...) without having to have 2
> > separate implementations of what i
ing/grouping/joins/faceting/...) without having to have 2
> separate implementations of what is the same thing. so its like
> "docvalues is a fieldcache you already built at index-time".
>
> >
> > Also, my similarity was extending SimilarityBase, and I can't see how
es is a fieldcache you already built at index-time".
>
> Also, my similarity was extending SimilarityBase, and I can't see how to
> get a docId as it is not passed in the score method "score(BasicStats
> stats, float freq, float docLen)". Will I need to extend using Simila
Okay, just for clarity sake, what you are saying is that if I make the
FieldCache call it won't actually create and impose the loading time of the
FieldCache, but rather just use the NumericDocValuesField instead. Is this
correct?
Also, my similarity was extending SimilarityBase, and I can&
ng.java
>>
>> On Mon, Aug 12, 2013 at 10:43 AM, Ross Woolf wrote:
>> > The JavaDocs for NumericDocValuesField indicates that this field value
>> can
>> > be used for scoring. The example shows how to store the field, but I am
>> > unclear as to how to retrie
lf wrote:
> > The JavaDocs for NumericDocValuesField indicates that this field value
> can
> > be used for scoring. The example shows how to store the field, but I am
> > unclear as to how to retrieve the value of the field while in a
> similarity
> > to use it when scoring
eld indicates that this field value can
> be used for scoring. The example shows how to store the field, but I am
> unclear as to how to retrieve the value of the field while in a similarity
> to use it when scoring a document? Can someone point me to an example or
> give me one that d
The JavaDocs for NumericDocValuesField indicates that this field value can
be used for scoring. The example shows how to store the field, but I am
unclear as to how to retrieve the value of the field while in a similarity
to use it when scoring a document? Can someone point me to an example or
It's not hard to implement one. Store your term value of your document with
payload. Then create your own Query and override the score function with
your cosine similarity logic.
The problem here is you need to watch out the performance, especially for
terms have very high DF. It may dec
Hi,
I would like to calculate raw cosine similarity between query and
document. I read documentation about lucene scoring but I'm still
confused. Does exist any implementation in Luscen 4.3.0 to do that. If
not, what is the easiest way to do this.
So far I'm retrieving a TermVector fo
Dear Users,
I'm calculation cosine similarity between two documents using code based
on the code at this link...
http://sujitpal.blogspot.ch/2011/10/computing-document-similarity-using.html
Is it working fine, but I want to use terms from two different fields in
my indexed docu
Have you already checked Solr's more like this?
http://wiki.apache.org/solr/MoreLikeThisHandler and
http://wiki.apache.org/solr/MoreLikeThis Your describe a problem similar to
the use case of that component and if there is something to hack is solr's
more like this.
Lucene's simi
Is there an api in Lucene for finding the similarity score for two
documents that have been randomly pulled from an index? What about for a
query and a randomly selected document?
I realize this isn't the standard purpose of Lucene, but I was given a task
to compare similarity scores fo
the performance (speed) ?
In either case... patches welcome!
On Mon, Feb 4, 2013 at 6:01 PM, Michael O'Leary wrote:
> I'd like to compare the relevance scores that are returned when using the
> Similarity classes that are available in Lucene 4.x, and it seems like
> using the
I'd like to compare the relevance scores that are returned when using the
Similarity classes that are available in Lucene 4.x, and it seems like
using the Benchmark component would be a good way to do that. It looks like
the isn't currently a way to specify a Similarity class to use in
On Fri, Nov 16, 2012 at 5:18 PM, Tom Burton-West wrote:
> Hi Otis,
>
> I hope this is not off-topic,
>
> Apparently in Lucene similarity does not have to be set at index time:
>
Actually in the general case it does. IndexWriter calls the Similarity's
computeNorm method
Hi Otis,
I hope this is not off-topic,
Apparently in Lucene similarity does not have to be set at index time:
See http://lucene.apache.org/core/4_0_0/changes/Changes.html under Lucene
2959
"All models default to the same index-time norm encoding as
DefaultSimilarity, so you can easily try
your index to the new
format with IndexUpgrader first."
So basically in my case I do not need to set it in the .alg file.
On Wed, Sep 5, 2012 at 7:58 AM, Sachin Kulkarni wrote:
> Hi,
>
> For Lucene core 4.0. BETA, under the search.similarities help page it says
> the followin
Hi,
For Lucene core 4.0. BETA, under the search.similarities help page it says
the following
"To change
Similarity<http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/Similarity.html>,
one must do so for both indexing and searching, and the changes
>> regardsshaimaa
>>
>> --
>> If you reply to this email, your message will be added to the
>> discussion below:
>> http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082.html
>> To unsubscribe from Lucene, click
>&g
thank you so much for the prompt reply
I need to extract a document from the index that is similar to an Html
document, and I need to use cosine similarity or latent semantic analysis which
means that I need to generate term vector for the html document, the link you
sent me doesn't co
can use to map the document to one of the documents in
> the index
> regardsshaimaa
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082.html
Resending again, since my question didn't get much attention
-- Forwarded message --
From: Kasun Perera
Date: Tue, Jun 19, 2012 at 3:26 PM
Subject: Different Weights to Lucene fields with Okapi Similarity
To: java-user@lucene.apache.org
Based on this link http://www200
Based on this link http://www2002.org/CDROM/refereed/643/node6.html , I'm
calculating Okapi similarity between the query document and another
document as below using Lucene:
I have indexed the documents using 3 fields. I want to give higher weight
to field 2 and field 3. I can't us
Hi All,
Sorry... I give wrong example, should be like this actually..
On Mon, May 21, 2012 at 9:31 PM, Robby wrote:
> - Grouping 1, count : 3
> - row id = 1
> - row id = 23
> - row id = 100
> - Grouping 2
> - row id = 11
> - row id = 133
> - ...
>
Regards,
Ro
tering based on similarity between four of five fields.
The end result would be something like this :
- Grouping 1, count : 3
- row id = 1
- row id = 23
- row id = 100
- Grouping 2
- row id = 1
- row id = 23
- ...
I have done some research and MoreLikeThis class can
and their term frequencies by
reading the index and calculate TF-IDF scores vector for each document.
Then using TF-IDF vectors, I calculate pairwise cosine similarity between
documents using the equation here
http://en.wikipedia.org/wiki/Cosine_similarity.
This is my problem
Say I have two identi
vector for each document.
> Then using TF-IDF vectors, I calculate pairwise cosine similarity between
> documents using the equation here
> http://en.wikipedia.org/wiki/Cosine_similarity.
>
> This is my problem
>
> Say I have two identical documents “A” and “B” in this collection (A
1 - 100 of 339 matches
Mail list logo