David Gonzalez created OAK-5692:
-----------------------------------

             Summary: Oak Lucene analyzers docs unclear on viable configurations
                 Key: OAK-5692
                 URL: https://issues.apache.org/jira/browse/OAK-5692
             Project: Jackrabbit Oak
          Issue Type: Documentation
            Reporter: David Gonzalez


The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at the vert least 
some tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? 
How are they selected for use? is the selection configurable?
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language based Stop words. Is this 
correct? This should be called out and link out to the catalog of OOTB stopword 
txt files for easy inclusion)
* The Stop filters words property must be a String not String[] and the value 
is a comma delimited String value. Would be good to call this out.
* What are all the CharFilters/Filters available? Is there a concise list w/ 
their params? (Ex. i think the PorterStem might support and ignoreCase param?)
* Synonym Filter syntax is unclear; It seems like here are 2 formats; 
directional x -> y and bi-directional (comma delimited); i could only get the 
latter to work.
* Are all the options in the link [2] supported. Its unclear if there is a 1:1 
between oak lucene and solr's capabilities or if [2] is a loose example of the 
"types" of supported analyzers.

[1]  http://jackrabbit.apache.org/oak/docs/query/lucene.html
[2] 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to