subject:"\[jira\] Commented\: \(LUCENE\-1696\) Added New Token API impl for ASCIIFoldingFilter"

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-07-14 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730866#action_12730866
]

Uwe Schindler commented on LUCENE-1696:
---

I already iplmeneted the new API in this filter for LUCENE-1693. Patch will
come shortly together with this issue.

The old API can be removed, the filter is now final and so next() and
nextToken() can be left unimplemented.

Added New Token API impl for ASCIIFoldingFilter
---

Key: LUCENE-1696
URL: https://issues.apache.org/jira/browse/LUCENE-1696
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
Fix For: 2.9

Attachments: ASCIIFoldingFilter._newTokenAPI.patch,
TestGermanCollation.java

I added an implementation of incrementToken to ASCIIFoldingFilter.java and
extended the existing testcase for it.
I will attach the patch shortly.
Beside this improvement I would like to start up a small discussion about
this filter. ASCIIFoldingFitler is meant to be a replacement for
ISOLatin1AccentFilter which is quite nice as it covers a superset of the
latter. I have used this filter quite often but never on a as it is basis. In
the most cases this filter does the correct thing (replace a special char
with its ascii correspondent) but in some cases like for German umlaut it
does not return the expected result. A german umlaut like 'ä' does not
translate to a but rather to 'ae'. I would like to change this but I'n not
100% sure if that is expected by all users of that filter. Another way of
doing it would be to make it configurable with a flag. This would not affect
performance as we only check if such a umlaut char is found.
Further it would be really helpful if that filter could inject the
original/unmodified token with the same position increment into the token
stream on demand. I think its a valid use-case to index the modified and
unmodified token. For instance, the german word süd would be folded to
sud. In a query q:(süd) the filter would also fold to sud and therefore
find sud which has a totally different meaning. Folding works quite well but
for special cases would could add those options to make users life easier.
The latter could be done in a subclass while the umlaut problem should be
fixed in the base class.
simon

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-07-14 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730867#action_12730867
 ] 

Mark Miller commented on LUCENE-1696:
-

Heh - hate to sound like a broken record, but: making this class finally breaks 
back compat?

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-07-14 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730868#action_12730868
 ] 

Uwe Schindler commented on LUCENE-1696:
---

No, it is a new class in 2.9 :-) ASCIIFoldingFilter is not in 2.4.1

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
 TestGermanCollation.java


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-07-14 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730870#action_12730870
]

Mark Miller commented on LUCENE-1696:
-

Ah, thanks. Thats hard to keep track of. It feels like I committed this so long
ago that it couldn't possibly be new ;)

Added New Token API impl for ASCIIFoldingFilter
---

Attachments: ASCIIFoldingFilter._newTokenAPI.patch,
TestGermanCollation.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-18 Thread Simon Willnauer (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721116#action_12721116
]

Simon Willnauer commented on LUCENE-1696:
-

I will be around and fix / adjust it if it needs some changes. If I do not
react please send me a ping on this issue. Thanks

Added New Token API impl for ASCIIFoldingFilter
---

Attachments: ASCIIFoldingFilter._newTokenAPI.patch,
TestGermanCollation.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-17 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721054#action_12721054
]

Mark Miller commented on LUCENE-1696:
-

Patch looks good! I'll just hold off till the token api improvement patch is
finished, just in case we need to make an adjustment here.

Added New Token API impl for ASCIIFoldingFilter
---

Attachments: ASCIIFoldingFilter._newTokenAPI.patch,
TestGermanCollation.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720173#action_12720173
 ] 

Robert Muir commented on LUCENE-1696:
-

Simon, I think if you want to handle accents in a language-dependent/correct 
way, you can use contrib/collation for this purpose.

i don't see an alternative, otherwise you will end out with 50-100 sets of 
language-dependent rules [essentially duplicating the logic collation already 
knows about]

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720183#action_12720183
]

Robert Muir commented on LUCENE-1696:
-

i uploaded a testcase under LUCENE-1581 showing how this works with
contrib/collation.

Added New Token API impl for ASCIIFoldingFilter
---

Attachments: ASCIIFoldingFilter._newTokenAPI.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720189#action_12720189
 ] 

Simon Willnauer commented on LUCENE-1696:
-

bq. i don't see an alternative, otherwise you will end out with 50-100 sets of 
language-dependent rules [essentially duplicating the logic collation already 
knows about]

I agree, that this would end up in a mess. Still collation is not an option as 
I can not rely on the local in that use-case.
I might have to stick with my changes for umlauts at least. :)

 Added New Token API impl for ASCIIFoldingFilter
 ---

 Key: LUCENE-1696
 URL: https://issues.apache.org/jira/browse/LUCENE-1696
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Simon Willnauer
 Fix For: 2.9

 Attachments: ASCIIFoldingFilter._newTokenAPI.patch


 I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
 extended the existing  testcase for it.
 I will attach the patch shortly.
 Beside this improvement I would like to start up a small discussion about 
 this filter. ASCIIFoldingFitler is meant to be a replacement for 
 ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
 latter. I have used this filter quite often but never on a as it is basis. In 
 the most cases this filter does the correct thing (replace a special char 
 with its ascii correspondent) but in some cases like for German umlaut it 
 does not return the expected result. A german umlaut  like 'ä' does not 
 translate to a but rather to 'ae'. I would like to change this but I'n not 
 100% sure if that is expected by all users of that filter. Another way of 
 doing it would be to make it configurable with a flag. This would not affect 
 performance as we only check if such a umlaut char is found. 
 Further it would be really helpful if that filter could inject the 
 original/unmodified token with the same position increment into the token 
 stream on demand. I think its a valid use-case to index the modified and 
 unmodified token. For instance, the german word süd would be folded to 
 sud. In a query q:(süd) the filter would also fold to sud and therefore 
 find sud which has a totally different meaning. Folding works quite well but 
 for special cases would could add those options to make users life easier. 
 The latter could be done in a subclass while the umlaut problem should be 
 fixed in the base class.
 simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Simon Willnauer (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720192#action_12720192
]

Simon Willnauer commented on LUCENE-1696:
-

Thanks robert,
I did know about collation before and I validated it for the usecase - I do not
know what language / local my docs are so I can not set the correct one.
Nevermind. :)

Added New Token API impl for ASCIIFoldingFilter
---

Attachments: ASCIIFoldingFilter._newTokenAPI.patch,
TestGermanCollation.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720193#action_12720193
]

Robert Muir commented on LUCENE-1696:
-

simon, actually i think its documented you can use ENGLISH collator and it will
behave like asciifolding filter (simply remove all diacritics).
you could then apply the tailorings like the example and get the behavior you
want, versus maintaining a custom asciifoldingfilter...

Added New Token API impl for ASCIIFoldingFilter
---

Attachments: ASCIIFoldingFilter._newTokenAPI.patch,
TestGermanCollation.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Simon Willnauer (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720197#action_12720197
]

Simon Willnauer commented on LUCENE-1696:
-

bq. simon, actually i think its documented you can use ENGLISH collator and it
will behave like asciifolding filter (simply remove all diacritics).
you could then apply the tailorings like the example and get the behavior you
want, versus maintaining a custom asciifoldingfilter...
will try, thanks!

Added New Token API impl for ASCIIFoldingFilter
---

Attachments: ASCIIFoldingFilter._newTokenAPI.patch,
TestGermanCollation.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

2009-06-16 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720201#action_12720201
]

Robert Muir commented on LUCENE-1696:
-

since this seems to be a recurring theme maybe a javadoc modification would be
useful.

otherwise i imagine you might receive lots of bug reports saying
'asciifoldingfilter does X for Y language incorrectly'.

part of the confusion might be because the docs say it 'converts to their ASCII
equivalents' and 'equivalent' means different things to different people in
different languages...

Added New Token API impl for ASCIIFoldingFilter
---

Attachments: ASCIIFoldingFilter._newTokenAPI.patch,
TestGermanCollation.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

13 matches

Site Navigation

Mail list logo

Footer information