[jira] [Updated] (OAK-5692) Oak Lucene analyzers docs unclear on viable configurations

2017-02-16 Thread David Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/OAK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Gonzalez updated OAK-5692:

Description: 
The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at the vert least 
some tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? 
How are they selected for use? is the selection configurable?
* By default is the analyzer index AND query time, unless specified by 
"type=index|query" property?
* What is the naming for multiple analyzer nodes? Are all children of analyzers 
assumed to be an analyzer? Ex. If i want a special configuration or index and 
another for query, could i create:
{noformat}
../myIndex/analyzers/indexAnalyzer@type=index
.. define the index-time analyzer ...
../myIndex/analyzers/queryAnalyzer@type=query
.. define the query-time analyzer ...
{noformat}
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language based Stop words. Is this 
correct? This should be called out and link out to the catalog of OOTB stopword 
txt files for easy inclusion)
* The Stop filters words property must be a String not String[] and the value 
is a comma delimited String value. Would be good to call this out.
* What are all the CharFilters/Filters available? Is there a concise list w/ 
their params? (Ex. i think the PorterStem might support and ignoreCase param?)
* Synonym Filter syntax is unclear; It seems like here are 2 formats; 
directional x -> y and bi-directional (comma delimited); i could only get the 
latter to work.
* Are all the options in the link [2] supported. Its unclear if there is a 1:1 
between oak lucene and solr's capabilities or if [2] is a loose example of the 
"types" of supported analyzers.

[1]  http://jackrabbit.apache.org/oak/docs/query/lucene.html
[2] 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema

  was:
The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at the vert least 
some tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? 
How are they selected for use? is the selection configurable?
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language based Stop words. Is this 
correct? This should be called out and link out to the catalog of OOTB stopword 
txt files for easy inclusion)
* The Stop filters words property must be a String not String[] and the value 
is a comma delimited String value. Would be good to call this out.
* What are all the CharFilters/Filters available? Is there a concise list w/ 
their params? (Ex. i think the PorterStem might support and ignoreCase param?)
* Synonym Filter syntax is unclear; It seems like here are 2 formats; 
directional x -> y and bi-directional (comma delimited); i could only get the 
latter to work.
* Are all the options in the link [2] supported. Its unclear if there is a 1:1 
between oak lucene and solr's capabilities or if [2] is a loose example of the 
"types" of supported analyzers.

[1]  http://jackrabbit.apache.org/oak/docs/query/lucene.html
[2] 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema


> Oak Lucene analyzers docs unclear on viable configurations
> --
>
> Key: OAK-5692
>

[jira] [Updated] (OAK-5692) Oak Lucene analyzers docs unclear on viable configurations

2017-02-16 Thread David Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/OAK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Gonzalez updated OAK-5692:

Description: 
The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at the vert least 
some tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? 
How are they selected for use? is the selection configurable?
* By default is the analyzer index AND query time, unless specified by 
`type=index|query` property?
* What is the naming for multiple analyzer nodes? Are all children of analyzers 
assumed to be an analyzer? Ex. If i want a special configuration or index and 
another for query, could i create:
{noformat}
../myIndex/analyzers/indexAnalyzer@type=index
.. define the index-time analyzer ...
../myIndex/analyzers/queryAnalyzer@type=query
.. define the query-time analyzer ...
{noformat}
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language based Stop words. Is this 
correct? This should be called out and link out to the catalog of OOTB stopword 
txt files for easy inclusion)
* The Stop filters words property must be a String not String[] and the value 
is a comma delimited String value. Would be good to call this out.
* What are all the CharFilters/Filters available? Is there a concise list w/ 
their params? (Ex. i think the PorterStem might support and ignoreCase param?)
* Synonym Filter syntax is unclear; It seems like here are 2 formats; 
directional x -> y and bi-directional (comma delimited); i could only get the 
latter to work.
* Are all the options in the link [2] supported. Its unclear if there is a 1:1 
between oak lucene and solr's capabilities or if [2] is a loose example of the 
"types" of supported analyzers.

[1]  http://jackrabbit.apache.org/oak/docs/query/lucene.html
[2] 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema

  was:
The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at the vert least 
some tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? 
How are they selected for use? is the selection configurable?
* By default is the analyzer index AND query time, unless specified by 
"type=index|query" property?
* What is the naming for multiple analyzer nodes? Are all children of analyzers 
assumed to be an analyzer? Ex. If i want a special configuration or index and 
another for query, could i create:
{noformat}
../myIndex/analyzers/indexAnalyzer@type=index
.. define the index-time analyzer ...
../myIndex/analyzers/queryAnalyzer@type=query
.. define the query-time analyzer ...
{noformat}
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language based Stop words. Is this 
correct? This should be called out and link out to the catalog of OOTB stopword 
txt files for easy inclusion)
* The Stop filters words property must be a String not String[] and the value 
is a comma delimited String value. Would be good to call this out.
* What are all the CharFilters/Filters available? Is there a concise list w/ 
their params? (Ex. i think the PorterStem might support and ignoreCase param?)
* Synonym Filter syntax is unclear; It seems like here are 2 formats; 
directional x -> y and bi-directional (comma delimited); i could only get the 
latter to work.
* Are all the options in the link [2] 

[jira] [Updated] (OAK-5692) Oak Lucene analyzers docs unclear on viable configurations

2017-02-16 Thread David Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/OAK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Gonzalez updated OAK-5692:

Description: 
The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at the vert least 
some tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? 
How are they selected for use? is the selection configurable?
* By default is the analyzer index AND query time, unless specified by 
`type=index|query` property?
* What is the naming for multiple analyzer nodes? Are all children of analyzers 
assumed to be an analyzer? Ex. If i want a special configuration or index and 
another for query, could i create:
{noformat}
../myIndex/analyzers/indexAnalyzer@type=index
.. define the index-time analyzer ...
../myIndex/analyzers/queryAnalyzer@type=query
.. define the query-time analyzer ...
{noformat}
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language based Stop words. Is this 
correct? This should be called out and link out to the catalog of OOTB stopword 
txt files for easy inclusion)
* The Stop filters words property must be a String not String[] and the value 
is a comma delimited String value. Would be good to call this out.
* What are all the CharFilters/Filters available? Is there a concise list w/ 
their params? (Ex. i think the PorterStem might support and ignoreCase param?)
* Synonym Filter syntax is unclear; It seems like here are 2 formats; 
directional x -> y and bi-directional (comma delimited); i could only get the 
latter to work.
* Are all the options in the link [2] supported. Its unclear if there is a 1:1 
between oak lucene and solr's capabilities or if [2] is a loose example of the 
"types" of supported analyzers.
* For things something like the PatternReplaceCharFilterFactory [3], how do you 
define multiple pattern mappings, as IIUC the charFilter node MUST be named:
{noformat}.../charFilters/PatternReplace{noforma} so you can't have multiple 
"PatternReplace" named nodes, each with its own "@pattern" and "@replace" 
properties.  It seems like there is only support for a single object for each 
Factory type?


Generally this seems like the handiest resource: 
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers%2C+Tokenizers%2C+and+Filters

[1]  http://jackrabbit.apache.org/oak/docs/query/lucene.html
[2] 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema
[3] https://cwiki.apache.org/confluence/display/solr/CharFilterFactories

  was:
The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at the vert least 
some tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? 
How are they selected for use? is the selection configurable?
* By default is the analyzer index AND query time, unless specified by 
`type=index|query` property?
* What is the naming for multiple analyzer nodes? Are all children of analyzers 
assumed to be an analyzer? Ex. If i want a special configuration or index and 
another for query, could i create:
{noformat}
../myIndex/analyzers/indexAnalyzer@type=index
.. define the index-time analyzer ...
../myIndex/analyzers/queryAnalyzer@type=query
.. define the query-time analyzer ...
{noformat}
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language based Stop wor

[jira] [Updated] (OAK-5692) Oak Lucene analyzers docs unclear on viable configurations

2017-02-16 Thread Chetan Mehrotra (JIRA)

 [ 
https://issues.apache.org/jira/browse/OAK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Mehrotra updated OAK-5692:
-
Description: 
The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at the vert least 
some tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? 
How are they selected for use? is the selection configurable?
* By default is the analyzer index AND query time, unless specified by 
`type=index|query` property?
* What is the naming for multiple analyzer nodes? Are all children of analyzers 
assumed to be an analyzer? Ex. If i want a special configuration or index and 
another for query, could i create:
{noformat}
../myIndex/analyzers/indexAnalyzer@type=index
.. define the index-time analyzer ...
../myIndex/analyzers/queryAnalyzer@type=query
.. define the query-time analyzer ...
{noformat}
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language based Stop words. Is this 
correct? This should be called out and link out to the catalog of OOTB stopword 
txt files for easy inclusion)
* The Stop filters words property must be a String not String[] and the value 
is a comma delimited String value. Would be good to call this out.
* What are all the CharFilters/Filters available? Is there a concise list w/ 
their params? (Ex. i think the PorterStem might support and ignoreCase param?)
* Synonym Filter syntax is unclear; It seems like here are 2 formats; 
directional x -> y and bi-directional (comma delimited); i could only get the 
latter to work.
* Are all the options in the link [2] supported. Its unclear if there is a 1:1 
between oak lucene and solr's capabilities or if [2] is a loose example of the 
"types" of supported analyzers.
* For things something like the PatternReplaceCharFilterFactory [3], how do you 
define multiple pattern mappings, as IIUC the charFilter node MUST be named:
{noformat}.../charFilters/PatternReplace{noformat} so you can't have multiple 
"PatternReplace" named nodes, each with its own "@pattern" and "@replace" 
properties.  It seems like there is only support for a single object for each 
Factory type?


Generally this seems like the handiest resource: 
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers%2C+Tokenizers%2C+and+Filters

[1]  http://jackrabbit.apache.org/oak/docs/query/lucene.html
[2] 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema
[3] https://cwiki.apache.org/confluence/display/solr/CharFilterFactories

  was:
The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at the vert least 
some tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? 
How are they selected for use? is the selection configurable?
* By default is the analyzer index AND query time, unless specified by 
`type=index|query` property?
* What is the naming for multiple analyzer nodes? Are all children of analyzers 
assumed to be an analyzer? Ex. If i want a special configuration or index and 
another for query, could i create:
{noformat}
../myIndex/analyzers/indexAnalyzer@type=index
.. define the index-time analyzer ...
../myIndex/analyzers/queryAnalyzer@type=query
.. define the query-time analyzer ...
{noformat}
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language based Stop 

[jira] [Updated] (OAK-5692) Oak Lucene analyzers docs unclear on viable configurations

2017-02-17 Thread David Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/OAK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Gonzalez updated OAK-5692:

Description: 
The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at a bare minimum, 
_some_ tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). 
** Can other analyzers be defined? 
** How are they selected for use? 
** is the selection configurable?
* Is the analyzer both index AND query time (unless specified by 
`type=index|query` property)?
* What is the naming for multiple analyzer nodes? Are all children of analyzers 
assumed to be an analyzer? Ex. If i want a special configuration or index and 
another for query, could i create:
{noformat}
../myIndex/analyzers/indexAnalyzer@type=index
.. define the index-time analyzer ...
../myIndex/analyzers/queryAnalyzer@type=query
.. define the query-time analyzer ...
{noformat}
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language based Stop words. Is this 
correct? This should be called out and link out to the catalog of OOTB stopword 
txt files for easy inclusion)
* The Stop filters words property must be a String not String[] and the value 
is a comma delimited String value. Would be good to call this out.
* What are all the CharFilters/Filters available? Is there a concise list w/ 
their params? (Ex. i think the PorterStem might support and ignoreCase param?)
* Synonym Filter syntax is unclear; It seems like here are 2 formats; 
directional x -> y and bi-directional (comma delimited); i could only get the 
latter to work.
* Are all the options in the link [2] supported. Its unclear if there is a 1:1 
between oak lucene and solr's capabilities or if [2] is a loose example of the 
"types" of supported analyzers.
* For things something like the PatternReplaceCharFilterFactory [3], how do you 
define multiple pattern mappings, as IIUC the charFilter node MUST be named:
{noformat}.../charFilters/PatternReplace{noformat} so you can't have multiple 
"PatternReplace" named nodes, each with its own "@pattern" and "@replace" 
properties.  It seems like there is only support for a single object for each 
Factory type?


Generally this seems like the handiest resource: 
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers%2C+Tokenizers%2C+and+Filters

[1]  http://jackrabbit.apache.org/oak/docs/query/lucene.html
[2] 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema
[3] https://cwiki.apache.org/confluence/display/solr/CharFilterFactories

  was:
The Oak lucene docs [1] > Analyzers section would benefit from clarification:

Combining analyzer-based topics into a single ticket

* If no analyzer is specified, what analyzer setup is used (at the vert least 
some tokenizer must be used)
* The docs mention the "default" analyzer 
([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be defined? 
How are they selected for use? is the selection configurable?
* By default is the analyzer index AND query time, unless specified by 
`type=index|query` property?
* What is the naming for multiple analyzer nodes? Are all children of analyzers 
assumed to be an analyzer? Ex. If i want a special configuration or index and 
another for query, could i create:
{noformat}
../myIndex/analyzers/indexAnalyzer@type=index
.. define the index-time analyzer ...
../myIndex/analyzers/queryAnalyzer@type=query
.. define the query-time analyzer ...
{noformat}
* How are languages handled? Ex. language specific stop words, synonyms, char 
mapping,  and Stemming.
* If 
[oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
 it appears the Standard Tokenizer and Standard Lowercase and Stop Filters are 
used. The Stop filter can be augmented w the well-named stopwords file.
** Can other charFilters/filters be layered on top of this "named" Analyzer (it 
seems not).
* When the Stop Filter is used it provided the OOTB language-based stop words. 
If a custom stopwords file is provided, that list replaced the OOTB lang-based, 
requiring the developer to provide their own language base