[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730866#action_12730866 ] Uwe Schindler commented on LUCENE-1696: --- I already iplmeneted the new API in this filter for LUCENE-1693. Patch will come shortly together with this issue. The old API can be removed, the filter is now final and so next() and nextToken() can be left unimplemented. Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730867#action_12730867 ] Mark Miller commented on LUCENE-1696: - Heh - hate to sound like a broken record, but: making this class finally breaks back compat? Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730868#action_12730868 ] Uwe Schindler commented on LUCENE-1696: --- No, it is a new class in 2.9 :-) ASCIIFoldingFilter is not in 2.4.1 Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Uwe Schindler Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730870#action_12730870 ] Mark Miller commented on LUCENE-1696: - Ah, thanks. Thats hard to keep track of. It feels like I committed this so long ago that it couldn't possibly be new ;) Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Uwe Schindler Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721116#action_12721116 ] Simon Willnauer commented on LUCENE-1696: - I will be around and fix / adjust it if it needs some changes. If I do not react please send me a ping on this issue. Thanks Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721054#action_12721054 ] Mark Miller commented on LUCENE-1696: - Patch looks good! I'll just hold off till the token api improvement patch is finished, just in case we need to make an adjustment here. Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720173#action_12720173 ] Robert Muir commented on LUCENE-1696: - Simon, I think if you want to handle accents in a language-dependent/correct way, you can use contrib/collation for this purpose. i don't see an alternative, otherwise you will end out with 50-100 sets of language-dependent rules [essentially duplicating the logic collation already knows about] Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720183#action_12720183 ] Robert Muir commented on LUCENE-1696: - i uploaded a testcase under LUCENE-1581 showing how this works with contrib/collation. Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720189#action_12720189 ] Simon Willnauer commented on LUCENE-1696: - bq. i don't see an alternative, otherwise you will end out with 50-100 sets of language-dependent rules [essentially duplicating the logic collation already knows about] I agree, that this would end up in a mess. Still collation is not an option as I can not rely on the local in that use-case. I might have to stick with my changes for umlauts at least. :) Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720192#action_12720192 ] Simon Willnauer commented on LUCENE-1696: - Thanks robert, I did know about collation before and I validated it for the usecase - I do not know what language / local my docs are so I can not set the correct one. Nevermind. :) Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720193#action_12720193 ] Robert Muir commented on LUCENE-1696: - simon, actually i think its documented you can use ENGLISH collator and it will behave like asciifolding filter (simply remove all diacritics). you could then apply the tailorings like the example and get the behavior you want, versus maintaining a custom asciifoldingfilter... Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720197#action_12720197 ] Simon Willnauer commented on LUCENE-1696: - bq. simon, actually i think its documented you can use ENGLISH collator and it will behave like asciifolding filter (simply remove all diacritics). you could then apply the tailorings like the example and get the behavior you want, versus maintaining a custom asciifoldingfilter... will try, thanks! Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter
[ https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720201#action_12720201 ] Robert Muir commented on LUCENE-1696: - since this seems to be a recurring theme maybe a javadoc modification would be useful. otherwise i imagine you might receive lots of bug reports saying 'asciifoldingfilter does X for Y language incorrectly'. part of the confusion might be because the docs say it 'converts to their ASCII equivalents' and 'equivalent' means different things to different people in different languages... Added New Token API impl for ASCIIFoldingFilter --- Key: LUCENE-1696 URL: https://issues.apache.org/jira/browse/LUCENE-1696 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 2.9 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCollation.java I added an implementation of incrementToken to ASCIIFoldingFilter.java and extended the existing testcase for it. I will attach the patch shortly. Beside this improvement I would like to start up a small discussion about this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1AccentFilter which is quite nice as it covers a superset of the latter. I have used this filter quite often but never on a as it is basis. In the most cases this filter does the correct thing (replace a special char with its ascii correspondent) but in some cases like for German umlaut it does not return the expected result. A german umlaut like 'ä' does not translate to a but rather to 'ae'. I would like to change this but I'n not 100% sure if that is expected by all users of that filter. Another way of doing it would be to make it configurable with a flag. This would not affect performance as we only check if such a umlaut char is found. Further it would be really helpful if that filter could inject the original/unmodified token with the same position increment into the token stream on demand. I think its a valid use-case to index the modified and unmodified token. For instance, the german word süd would be folded to sud. In a query q:(süd) the filter would also fold to sud and therefore find sud which has a totally different meaning. Folding works quite well but for special cases would could add those options to make users life easier. The latter could be done in a subclass while the umlaut problem should be fixed in the base class. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org