Re: substring query

2015-03-04 Thread Herb Roitblat
Do you want to search for shingles? On 3/4/2015 9:16 PM, Stephen Rudd wrote: I have created a slightly hairy document collection that contains 10s of millions of DNA sequence words that I wish to process to find rarer and unique words. Each of the words is between 100 characters (nucleotides)

Re: is there a historical reason why default conjunction operator is "OR"?

2014-04-16 Thread Herb Roitblat
Actually, Google uses OR. The scoring algorithm favors documents that match on more of the ORed terms. On 4/16/2014 8:17 AM, Min-Uk Kim wrote: Hello everyone, I recently wondered, why lucene's default conjunction operator is "OR". Is there a historical reason for that? By the way, Google an

Re: Confuse with Kuromoji

2014-04-06 Thread Herb Roitblat
Thanks. These are familiar. Any other approaches that people use? I guess I'm hoping ... On 4/6/2014 7:37 AM, Benson Margulies wrote: On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat wrote: Just curious, what are some of the things that people do to properly tokenize the queries with

Re: Confuse with Kuromoji

2014-04-06 Thread Herb Roitblat
Just curious, what are some of the things that people do to properly tokenize the queries with mixed language collections? What do you do with mixed language queries? On 4/6/2014 4:51 AM, Benson Margulies wrote: You must know what language each text is in, and use an appropriate analyzer. Som

Re: QueryParser

2014-03-24 Thread Herb Roitblat
The default query parser for CJK languages breaks text into bigrams. A word consisting of characters ABCDE is broken into tokens AB, BC, CD, DE, or "轻歌曼舞庆元旦" into data:轻歌 data:歌曼 data:曼舞 data:舞庆 data:庆元 data:元旦 Each pair may or may not be a word, but if you use the same parser (i.e. analyz

Re: Dimension mismatch exception

2014-03-21 Thread Herb Roitblat
Computing the cosine between two documents requires that the vectors for each document to be the same length (same number of elements, same dimensionality, not the norm). The length of the vector is the length of the vocabulary for the whole set. The two sets will inevitably have different nu

Re: Dimension mismatch exception

2014-03-20 Thread Herb Roitblat
If you want to compute the cosines between pairs of documents (each a compared with each b), then the dimension is 100, the size of each document. If you want to compare the whole index then you will need to make them the same length (number of elements) by padding the shorter with zeroes. There

Re: debug filters

2012-01-02 Thread Herb Roitblat
I got that one figured out. Thanks. On 12/31/2011 5:51 PM, Herb Roitblat wrote: Can someone point me to information on how to debug a filter? How do I access the bit-string? Our problem seems to be that when we set a filter, not all of the appropriate bits are set and when we use the filter

debug filters

2011-12-31 Thread Herb Roitblat
Can someone point me to information on how to debug a filter? How do I access the bit-string? Our problem seems to be that when we set a filter, not all of the appropriate bits are set and when we use the filter to retrieve the documents, not all of the documents that we intended to set are r

Re: Picking single results out of a list of results

2011-10-19 Thread Herb Roitblat
hem. -- Ian. On Sat, Oct 15, 2011 at 7:47 PM, Herb Roitblat wrote: I have an application where I would like to pick one document from somewhere in the list of search results. For example, I would like to retrieve one of the results at rank 57, another at rank 1223, etc. I'm not real clear on how

Picking single results out of a list of results

2011-10-15 Thread Herb Roitblat
I have an application where I would like to pick one document from somewhere in the list of search results. For example, I would like to retrieve one of the results at rank 57, another at rank 1223, etc. I'm not real clear on how to do it. I have seen some things on simulating pagination wi