On Tue, Nov 18, 2014 at 1:16 PM, Marvin Humphrey <[email protected]> wrote: > On Sat, Nov 15, 2014 at 3:22 AM, Michael McCandless > <[email protected]> wrote: > >> The analysis chain (attributes) is overly complex. > > If you were to start from scratch, what would the analysis chain look like?
Hi Marvin, long time no talk! I like the new Go bindings for Lucy! Here are some things that bug me about Lucene's analysis APIs: Lucene's Attributes have separate interface from impl, with default impls, and this causes complex code in oal.util.Attribute*. It seems like overkill. Seems like we should just have concrete core impls for the atts Lucene knows how to index. There are 5 java source files in that package related to attributes (Attribute.java AttributeFactory.java AttributeImpl.java AttributeReflector.java AttributeSource.java): too much. There should not be a global AttributeFactory that owns all attrs throughout the pipeline: that's too global. Rather, each stage should be free to control what the next stages sees (LUCENE-2450) ... the namespace should be private to that stage, and each stage can delete/add/replace the incoming bindings it saw. This may seem more complex but I think it'd be simpler in the end? And, the first stage should not have to be responsible for clearing things that later stages had inserted: common source of bugs for that first Tokenizer to not call clearAttributes. Reuse of token streams was an "afterthought" that took a long time to work its way down to simpler APIs, but now we ReuseStrategy, AnalyzerWrapper, DelegatingAnalzyerWrapper. Custom analyzers can't be (easily?) serialized, so ES and Solr have their own layers to parse a custom chain from JSON/XML. Those layers could do better error checking... Can we do something better with offsets, such that TokenFilters (not just Tokenizers/CharReaders) would also be able to set correct offsets? The stuffing of things into "analysis" that really should have been a "gentle schema" is annoying: KeywordAnalyzer, Numeric*. Token filters that want to create graphs are nearly impossible. E.g you cannot put a WDF in front of SynonymFilter today because SynonymFilter can't handle an incoming graph (LUCENE-5012). Deleted tokens should still be present, just "marked" as deleted (so IW doesn't index them). This would make it possible (to Rob's horror) for tokenizers to preserve every single character they saw, but things that are not tokens (punctuation, whitespace) are marked deleted. Maybe this makes it possible for all stages to work with offsets properly? There is probably more, and probably lots of people disagree that these are even "problems" :) Mike McCandless http://blog.mikemccandless.com
