Dawid Weiss created LUCENE-9740:
-----------------------------------
Summary: Avoid buffering and double-scan of flags in *.aff file
Key: LUCENE-9740
URL: https://issues.apache.org/jira/browse/LUCENE-9740
Project: Lucene - Core
Issue Type: Sub-task
Reporter: Dawid Weiss
Assignee: Dawid Weiss
I wrote a small utility test to scan through all the *.aff files from
openoffice and woorm - no file has double flags (SET or FLAG) and maximum
leading offsets until these flags appear are roughly:
{code}
Flag SET at maximum offset 10753
Flag FLAG at maximum offset 4559
{code}
I think we could just make an assumption that, say, affix files are read with
an 20kB buffered reader and this provides a maximum leading window for scanning
for those flags. The dictionary parsing could also fail if any of these flags
occurs more than once in the input file?
This would avoid having to read the file twice and perhaps simplify the API (no
need for a temporary spill).
I'll piggyback this test as part of LUCENE-9727 if you'd like to re-run it
locally.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]