Spellchecker could/should use n-gram tokenizers instead of rolling its own
n-gramming
-
Key: LUCENE-760
URL: http://issues.apache.org/jira/browse/LUCENE-760
Project:
[ http://issues.apache.org/jira/browse/LUCENE-759?page=all ]
Otis Gospodnetic resolved LUCENE-759.
-
Resolution: Fixed
Unit tests pass, committed.
> Add n-gram tokenizers to contrib/analyzers
> --
>
>
[ http://issues.apache.org/jira/browse/LUCENE-759?page=all ]
Otis Gospodnetic updated LUCENE-759:
Attachment: LUCENE-759.patch
Included:
NGramTokenizer
NGramTokenizerTest
EdgeNGramTokenizer
EdgeNGramTokenizerTest
> Add n-gram tokenizers to
Nicolas Lalevée wrote:
I have just looked at it. It looks great :)
Thanks! :-)
But I still doesn't understand why a new entry in the fieldinfo is needed.
The entry is not really *needed*, but I use it for
backwards-compatibility and as an optimization for fields that don't
have any
Add n-gram tokenizers to contrib/analyzers
--
Key: LUCENE-759
URL: http://issues.apache.org/jira/browse/LUCENE-759
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
[
http://issues.apache.org/jira/browse/LUCENE-708?page=comments#action_12460595 ]
Grant Ingersoll commented on LUCENE-708:
Nightly build distribution of binary jars should not contain instrumented
classes from Clover.
> Setup nightly bu
On Dec 22, 2006, at 10:36 AM, Doug Cutting wrote:
The easiest way to do this would be to have separate files in each
segment for each PostingFormat. It would be better if different
posting formats could share files, but that's harder to coordinate.
The approach I'm taking in KinoSearch 0.
On 12/22/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
Ning Li wrote:
> The draft proposal seems to suggest the following (roughly):
> A dictionary entry is .
Perhaps this ought to be , where TermInfo contains a
FilePointer and perhaps other information (e.g., frequency data).
Yes. Another exam
Ning Li wrote:
I'm aware of this design. Boolean and phrase queries are an example.
The point is, there are different queries whose processing will
(continue to) require different information of terms, especially when
flexible posting is allowed. The question is, should the number of
files used t
Ning Li wrote:
The draft proposal seems to suggest the following (roughly):
A dictionary entry is .
Perhaps this ought to be , where TermInfo contains a
FilePointer and perhaps other information (e.g., frequency data).
A posting entry for a term in a document is .
Classes which implement
On Dec 22, 2006, at 9:17 AM, Ning Li wrote:
The question is, should the number of
files used to store postings be customizable?
I think it ought to remain an implementation detail for now. Using
multiple files is an optimization of unknown advantage.
Optimizations have to work very hard
[
http://issues.apache.org/jira/browse/LUCENE-758?page=comments#action_12460555 ]
Michael McCandless commented on LUCENE-758:
---
Thank you for the full test case showing the issue!
However, I believe this is by design. When you init a R
[ http://issues.apache.org/jira/browse/LUCENE-662?page=all ]
Nicolas Lalevée updated LUCENE-662:
---
Attachment: generic-fieldIO-4.patch
Patch synchronized with the trunk.
I also tried to minimize the diff. And in fact I just realized that there are
two
On 12/22/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
Precision would be enhanced if boolean scoring took position into
account, and could be further enhanced if each position were assigned
a boost. For that purpose, having everything in one file is an
advantage, as it cuts down disk seeks. T
On Dec 21, 2006, at 1:58 PM, Ning Li wrote:
Storing all the posting content, e.g. frequencies and positions, in a
single file greatly simplifies things. However, this could cause some
performance penalty. For example, boolean query 'Apache AND Lucene'
would have to paw through positions. But po
Le Mercredi 20 Décembre 2006 20:42, Michael Busch a écrit :
> Doug Cutting wrote:
> > Michael,
> >
> > This sounds like very good work. The back-compatibility of this
> > approach is great. But we should also consider this in the broader
> > context of index-format flexibility.
> >
> > Three gene
[
http://issues.apache.org/jira/browse/LUCENE-755?page=comments#action_12460496 ]
Grant Ingersoll commented on LUCENE-755:
Great patch, Michael, and something that will come in handy for a lot of
people. I can vouch it applies cleanly a
IndexReader.isCurrent fails when using two IndexReaders
---
Key: LUCENE-758
URL: http://issues.apache.org/jira/browse/LUCENE-758
Project: Lucene - Java
Issue Type: Bug
Affects Versions:
[ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]
Doron Cohen updated LUCENE-756:
---
Attachment: nrm.patch.2.txt
nrm.patch.2.txt:
Updated as Doug suggested:
- ".nrm" extension now maintained in a constant .
- .nrm file now has a 4 bytes header.
19 matches
Mail list logo