[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-07-14 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730866#action_12730866
 ] 

Uwe Schindler commented on LUCENE-1696:
---

I already iplmeneted the new API in this filter for LUCENE-1693. Patch will 
come shortly together with this issue.

The old API can be removed, the filter is now final and so next() and 
nextToken() can be left unimplemented.

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-07-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730867#action_12730867
 ] 

Mark Miller commented on LUCENE-1696:
-

Heh - hate to sound like a broken record, but: making this class finally breaks 
back compat?

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-07-14 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730868#action_12730868
 ] 

Uwe Schindler commented on LUCENE-1696:
---

No, it is a new class in 2.9 :-) ASCIIFoldingFilter is not in 2.4.1

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-07-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730870#action_12730870
 ] 

Mark Miller commented on LUCENE-1696:
-

Ah, thanks. Thats hard to keep track of. It feels like I committed this so long 
ago that it couldn't possibly be new ;)

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-18 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721116#action_12721116
 ] 

Simon Willnauer commented on LUCENE-1696:
-

I will be around and fix / adjust it if it needs some changes. If I do not 
react please send me a ping on this issue. Thanks

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721054#action_12721054
 ] 

Mark Miller commented on LUCENE-1696:
-

Patch looks good! I'll just hold off till the token api improvement patch is 
finished, just in case we need to make an adjustment here.

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720173#action_12720173
 ] 

Robert Muir commented on LUCENE-1696:
-

Simon, I think if you want to handle accents in a language-dependent/correct 
way, you can use contrib/collation for this purpose.

i don't see an alternative, otherwise you will end out with 50-100 sets of 
language-dependent rules [essentially duplicating the logic collation already 
knows about]

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720183#action_12720183
 ] 

Robert Muir commented on LUCENE-1696:
-

i uploaded a testcase under LUCENE-1581 showing how this works with 
contrib/collation.

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720189#action_12720189
 ] 

Simon Willnauer commented on LUCENE-1696:
-

bq. i don't see an alternative, otherwise you will end out with 50-100 sets of 
language-dependent rules [essentially duplicating the logic collation already 
knows about]

I agree, that this would end up in a mess. Still collation is not an option as 
I can not rely on the local in that use-case.
I might have to stick with my changes for umlauts at least. :)

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720192#action_12720192
 ] 

Simon Willnauer commented on LUCENE-1696:
-

Thanks robert,
I did know about collation before and I validated it for the usecase - I do not 
know what language / local my docs are so I can not set the correct one. 
Nevermind. :)

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720193#action_12720193
 ] 

Robert Muir commented on LUCENE-1696:
-

simon, actually i think its documented you can use ENGLISH collator and it will 
behave like asciifolding filter (simply remove all diacritics).
you could then apply the tailorings like the example and get the behavior you 
want, versus maintaining a custom asciifoldingfilter...

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720197#action_12720197
 ] 

Simon Willnauer commented on LUCENE-1696:
-


bq. simon, actually i think its documented you can use ENGLISH collator and it 
will behave like asciifolding filter (simply remove all diacritics).
you could then apply the tailorings like the example and get the behavior you 
want, versus maintaining a custom asciifoldingfilter... 
will try, thanks!

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720201#action_12720201
 ] 

Robert Muir commented on LUCENE-1696:
-

since this seems to be a recurring theme maybe a javadoc modification would be 
useful.

otherwise i imagine you might receive lots of bug reports saying 
'asciifoldingfilter does X for Y language incorrectly'.

part of the confusion might be because the docs say it 'converts to their ASCII 
equivalents' and 'equivalent' means different things to different people in 
different languages...


 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org