Re: Too long token is not handled properly?

Steve Rowe Mon, 14 Nov 2016 07:15:28 -0800

Hi Alexey,

> On Nov 14, 2016, at 3:49 AM, Alexey Makeev <makeev...@mail.ru.INVALID> wrote:
> 
> But, please correct me if I wrong, this change of semantics (which has 
> implications from the user point of view) was a workaround for a performance 
> problem? I there was't the performance problem, it would be better to keep 
> original semantics?


Yes, I think so too.

> E.g. suppose we're indexing text <255 random letters>bob <255 random 
> letters>ed, with current implementation we'll have tokens bob and ed in 
> index. But from the user point of view it's unexpected: neither Bob nor Ed 
> was't mentioned in the text.
> Higher maxTokenLength + LengthFilter could solve this, but I'm think it's a 
> workaround too. What value for maxTokenLength should I set? 1M? But what if 
> there will be 2M token in the text?

Yes, that is a problem.  I suspect though for people that have such data and 
are negatively impacted by split tokens (actually only shorter trailing final 
tokens from a split long sequence are problematic, since the leading tokens can 
be stripped by LengthFilter), a CharFilter that removes such character 
sequences before tokenization, likely regex-based, is probably the best way to 
go for now.

> I agree it's difficult task to make JFlex code be able to silently skip too 
> long tokens. I scheduled for myself attempt to fix it some months later with 
> the following approach. In case we encountered situation when buffer is full 
> and there still could be a bigger match, enter "skipping" mode. In the 
> skipping mode full buffer is emptied, corresponding indexes (zzEndRead and 
> others) are corrected and matching continues. When we hit maximum length 
> match, skipping mode is finished and without returning a token and after yet 
> another indexes correction we enter normal mode. This approach to JFlex 
> matching won't work in general, but I suppose it'll work for tokenizer, 
> because I did't see any backtracking in the code (zzCurrentPos never 
> backtracks non-processed characters).
> It would be great to hear you thoughts on this idea.

Patches welcome!  I’m not quite sure how you’ll be able to do this for 
arbitrary match points within arbitrary rules, but I think it’s worth exploring.

--
Steve
www.lucidworks.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Too long token is not handled properly?

Reply via email to