[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798271#action_12798271 ] Koji Sekiguchi commented on SOLR-1653: -- Thanks, Paul! I've just committed revision 897357. add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch, SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797601#action_12797601 ] Paul taylor commented on SOLR-1653: --- Hi, Im using in non Solr in an analyser, and think there maybe a performance issue because you cannot pass a compiled Pattern. In the reusableTokenStream() method you cannot reset a charfilter like you can a tokenizer so it as to recompile the pattern everytime i.e. public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { SavedStreams streams = (SavedStreams)getPreviousTokenStream(); if (streams == null) { streams = new SavedStreams(); setPreviousTokenStream(streams); streams.tokenStream = new StandardTokenizer(Version.LUCENE_CURRENT,new PatternReplaceCharFilter((no\\.) ([0-9]+),$1$2,reader)); streams.filteredTokenStream = new StandardFilter(streams.filteredTokenStream); streams.filteredTokenStream = new AccentFilter(streams.filteredTokenStream); streams.filteredTokenStream = new LowercaseFilter(streams.filteredTokenStream); } else { streams.tokenStream.reset(new PatternReplaceCharFilter((no\\.) ([0-9]+),$1$2,reader)); } return streams.filteredTokenStream; } add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch, SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790056#action_12790056 ] Koji Sekiguchi commented on SOLR-1653: -- Ok. I'll show you same samples ;-) ||INPUT||groupedPattern||replaceGroups||OUTPUT||comment|| |see-ing looking|(\w+)(ing)|1|see-ing look|remove ing from the end of word| |see-ing looking|(\w+)ing|1|see-ing look|same as above. 2nd parentheses can be omitted| |No.1 NO. no. 543|[nN][oO]\.\s*(\d+)|{#},1|#1 NO. #543|sample for literal. do not forget to set blockDelimiters other than period when you use period in groupedPattern| |abc-1234-5678|(\w+)-(\d+)-(\d+)|3,{-},1,{-},2|5678-abc-1234|change the order of the groups| add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790067#action_12790067 ] Noble Paul commented on SOLR-1653: -- I guess this can be achieved with the matcher#replaceAll() directly input = see-ing looking regex = (\w+)(ing) replaceWith = $1 input = abc=1234=5678 regex =(\w+)=(\d+)=(\d+) replaceWith=$3=$1=$2 add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790127#action_12790127 ] Koji Sekiguchi commented on SOLR-1653: -- bq. I guess this can be achieved with the matcher#replaceAll() directly You're right if we don't correct offset of the output char stream. I need to process one match at a time. add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790129#action_12790129 ] Noble Paul commented on SOLR-1653: -- bq.I need to process one match at a time. I guess regex can process one match at a time. The most important point is that , we don't need to educate the users on this new syntax. (I am still not clear about the syntax) . No need to write any parsing code and maintain it add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790565#action_12790565 ] Noble Paul commented on SOLR-1653: -- In Solr we refer to Regular Expression Strings as 'regex' . If you think 'pattern' is ok , go ahead. add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch, SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790572#action_12790572 ] Koji Sekiguchi commented on SOLR-1653: -- I see that existing PatternReplaceFilter (not CharFilter) is using pattern. But it uses replacement, not replaceWith. I think I use pattern and replacement. add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch, SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790577#action_12790577 ] Shalin Shekhar Mangar commented on SOLR-1653: - bq. If there is no objections, I'll commit later today. +1 Thanks Koji! add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch, SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789957#action_12789957 ] Koji Sekiguchi commented on SOLR-1653: -- I'll commit in a few days. add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter
[ https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790026#action_12790026 ] Shalin Shekhar Mangar commented on SOLR-1653: - Koji, even after reading through the test, I do not understand how to use it. Are the characters in curly braces, written down for non-groups only? What if I want to remove one particular group? It is always good to write a use-case and an example in the issue description itself. add PatternReplaceCharFilter Key: SOLR-1653 URL: https://issues.apache.org/jira/browse/SOLR-1653 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 1.5 Attachments: SOLR-1653.patch Add a new CharFilter that uses a regular expression for the target of replace string in char stream. Usage: {code:title=schema.xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory groupedPattern=([nN][oO]\.)\s*(\d+) replaceGroups=1,2 blockDelimiters=:;/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.