Do you want to search for shingles?
On 3/4/2015 9:16 PM, Stephen Rudd wrote:
I have created a slightly hairy document collection that contains 10s of
millions of DNA sequence words that I wish to process to find rarer and unique
words. Each of the words is between 100 characters (nucleotides)
Actually, Google uses OR. The scoring algorithm favors documents that
match on more of the ORed terms.
On 4/16/2014 8:17 AM, Min-Uk Kim wrote:
Hello everyone,
I recently wondered,
why lucene's default conjunction operator is "OR".
Is there a historical reason for that?
By the way,
Google an
Thanks.
These are familiar. Any other approaches that people use? I guess I'm
hoping ...
On 4/6/2014 7:37 AM, Benson Margulies wrote:
On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat wrote:
Just curious, what are some of the things that people do to properly
tokenize the queries with
Just curious, what are some of the things that people do to properly
tokenize the queries with mixed language collections? What do you do
with mixed language queries?
On 4/6/2014 4:51 AM, Benson Margulies wrote:
You must know what language each text is in, and use an appropriate
analyzer. Som
The default query parser for CJK languages breaks text into bigrams. A
word consisting of characters ABCDE is broken into tokens AB, BC, CD,
DE, or
"轻歌曼舞庆元旦"
into
data:轻歌 data:歌曼 data:曼舞 data:舞庆 data:庆元 data:元旦
Each pair may or may not be a word, but if you use the same parser (i.e.
analyz
Computing the cosine between two documents requires that the vectors for
each document to be the same length (same number of elements, same
dimensionality, not the norm). The length of the vector is the length
of the vocabulary for the whole set. The two sets will inevitably have
different nu
If you want to compute the cosines between pairs of documents (each a compared
with each b), then the dimension is 100, the size of each document. If you want
to compare the whole index then you will need to make them the same length
(number of elements) by padding the shorter with zeroes. There
I got that one figured out. Thanks.
On 12/31/2011 5:51 PM, Herb Roitblat wrote:
Can someone point me to information on how to debug a filter? How do
I access the bit-string? Our problem seems to be that when we set a
filter, not all of the appropriate bits are set and when we use the
filter
Can someone point me to information on how to debug a filter? How do I
access the bit-string? Our problem seems to be that when we set a
filter, not all of the appropriate bits are set and when we use the
filter to retrieve the documents, not all of the documents that we
intended to set are r
hem.
--
Ian.
On Sat, Oct 15, 2011 at 7:47 PM, Herb Roitblat wrote:
I have an application where I would like to pick one document from somewhere
in the list of search results. For example, I would like to retrieve one of
the results at rank 57, another at rank 1223, etc. I'm not real clear on
how
I have an application where I would like to pick one document from
somewhere in the list of search results. For example, I would like to
retrieve one of the results at rank 57, another at rank 1223, etc. I'm
not real clear on how to do it.
I have seen some things on simulating pagination wi
11 matches
Mail list logo