[ https://issues.apache.org/jira/browse/OAK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chetan Mehrotra resolved OAK-5692. ---------------------------------- Resolution: Fixed Fix Version/s: 1.8 1.7.0 Updated docs have been published so resolving the issue > Oak Lucene analyzers docs unclear on viable configurations > ---------------------------------------------------------- > > Key: OAK-5692 > URL: https://issues.apache.org/jira/browse/OAK-5692 > Project: Jackrabbit Oak > Issue Type: Documentation > Reporter: David Gonzalez > Assignee: Chetan Mehrotra > Fix For: 1.7.0, 1.8 > > > The Oak lucene docs [1] > Analyzers section would benefit from clarification: > Combining analyzer-based topics into a single ticket > * If no analyzer is specified, what analyzer setup is used (at a bare > minimum, _some_ tokenizer must be used) > * The docs mention the "default" analyzer > ([oak:queryIndexDefinition]/analyzers/default). > ** Can other analyzers be defined? > ** How are they selected for use? > ** is the selection configurable? > * Is the analyzer both index AND query time (unless specified by > `type=index|query` property)? > * What is the naming for multiple analyzer nodes? Are all children of > analyzers assumed to be an analyzer? Ex. If i want a special configuration or > index and another for query, could i create: > {noformat} > ../myIndex/analyzers/indexAnalyzer@type=index > .. define the index-time analyzer ... > ../myIndex/analyzers/queryAnalyzer@type=query > .. define the query-time analyzer ... > {noformat} > * How are languages handled? Ex. language specific stop words, synonyms, char > mapping, and Stemming. > * If > [oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer > it appears the Standard Tokenizer and Standard Lowercase and Stop Filters > are used. The Stop filter can be augmented w the well-named stopwords file. > ** Can other charFilters/filters be layered on top of this "named" Analyzer > (it seems not). > * When the Stop Filter is used it provided the OOTB language-based stop > words. If a custom stopwords file is provided, that list replaced the OOTB > lang-based, requiring the developer to provide their own language based Stop > words. Is this correct? This should be called out and link out to the catalog > of OOTB stopword txt files for easy inclusion) > * The Stop filters words property must be a String not String[] and the value > is a comma delimited String value. Would be good to call this out. > * What are all the CharFilters/Filters available? Is there a concise list w/ > their params? (Ex. i think the PorterStem might support and ignoreCase param?) > * Synonym Filter syntax is unclear; It seems like here are 2 formats; > directional x -> y and bi-directional (comma delimited); i could only get the > latter to work. > * Are all the options in the link [2] supported. Its unclear if there is a > 1:1 between oak lucene and solr's capabilities or if [2] is a loose example > of the "types" of supported analyzers. > * For things something like the PatternReplaceCharFilterFactory [3], how do > you define multiple pattern mappings, as IIUC the charFilter node MUST be > named: > {noformat}.../charFilters/PatternReplace{noformat} so you can't have multiple > "PatternReplace" named nodes, each with its own "@pattern" and "@replace" > properties. It seems like there is only support for a single object for each > Factory type? > Generally this seems like the handiest resource: > https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers%2C+Tokenizers%2C+and+Filters > [1] http://jackrabbit.apache.org/oak/docs/query/lucene.html > [2] > https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema > [3] https://cwiki.apache.org/confluence/display/solr/CharFilterFactories -- This message was sent by Atlassian JIRA (v6.3.15#6346)