Dawid Weiss created LUCENE-9740:
-----------------------------------

             Summary: Avoid buffering and double-scan of flags in *.aff file
                 Key: LUCENE-9740
                 URL: https://issues.apache.org/jira/browse/LUCENE-9740
             Project: Lucene - Core
          Issue Type: Sub-task
            Reporter: Dawid Weiss
            Assignee: Dawid Weiss


I wrote a small utility test to scan through all the *.aff files from 
openoffice and woorm - no file has double flags (SET or FLAG) and maximum 
leading offsets until these flags appear are roughly:
{code}
Flag SET at maximum offset 10753
Flag FLAG at maximum offset 4559
{code}

I think we could just make an assumption that, say, affix files are read with 
an 20kB buffered reader and this provides a maximum leading window for scanning 
for those flags. The dictionary parsing could also fail if any of these flags 
occurs more than once in the input file?

This would avoid having to read the file twice and perhaps simplify the API (no 
need for a temporary spill).

I'll piggyback this test as part of LUCENE-9727 if you'd like to re-run it 
locally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to