[GitHub] [lucene-solr] magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

GitBox Mon, 30 Sep 2019 09:27:52 -0700

magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to 
support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#issuecomment-536641016
 
 
   Thank you for the review/feedback @msokolov! In addition to (as you mention) 
offset accuracy, some form of incremental transliteration with rollback is 
necessary for baseline support of consistent transliteration in a streaming 
fashion. The only alternatives I can think of are:
   1. to load the entire input in memory and transliterate the entire block at 
once (which would lose you _all_ offset accuracy and could potentially be 
problematic memory-wise for large inputs), or
   2. transliterate input in arbitrary-sized chunks (which would would get you 
offset accuracy at the level of granularity of the individual chunks).
   
   The second of these options sounds ok for cases where you don't care about 
offsets, but a bigger problem with this option is that regardless of whether 
nominally processed "incrementally", the tail end of each chunk would in some 
cases undergo partial transformation yielding inconsistent results.
   
   I was curious as well about the performance implications of rollback (as 
compared to the `ICUTransformTokenFilter`, which has the input pre-split and 
can thus afford single-pass transliteration of each chunk). I put together a 
_really_ quick and dirty comparison that just runs inputs with 
configurably-sized, whitespace-delimited tokens that consist entirely of one 
character. The results (at least initially, superficially) seem to suggest that 
the latency scales ~linearly with respect to the average length of character 
runs that can be completely transliterated (advancing the rollback window) -- 
or, put another way, the average size to which the rollback buffer content 
grows before being "committed". See initial performance evaluation code at:
   
   
[charFilterPerformanceTest.txt](https://github.com/apache/lucene-solr/files/3672215/charFilterPerformanceTest.txt)
   
   It's a tricky thing to test, and I'd like to see it evaluated against 
real-world input, but the performance impact of incremental transliteration 
would also be highly dependent on the particular transliteration type (e.g., 
top-level `RuleBasedTransliterator` instances should never require any 
rollback, and should be plenty fast; some `CompoundTransliterator` instances 
should require little or no rollback, others (particularly ones that bundle 
trailing NFC transformation, like "Cyrillic-Latin") might require rollback for 
most non-delimiter character runs.
   
   If we somehow know that offsets are not necessary for a given instance, 
there would be some optimizations possible (like initially buffering some 
large-but-not-too-large number of characters for a given input, with the hope 
that we'll reach the end of the input and can just do block transliteration).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

Reply via email to