[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2014-03-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-4072:


Attachment: LUCENE-4072.patch

Whew, thank you!

I did some minor cleanup: I toned down the tests i had added that were very 
slow (added multiplier, so they will do more work in jenkins), added 
testMassiveLigature (just to test the case where normalization increases the 
length), and removed the stuff around reset()... since mark isnt supported the 
default UOE is the right thing.

I'll commit shortly

 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: 4072.patch, 4072.patch, DebugCode.txt, 
 LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, 
 LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, 
 ippeiukai-ICUNormalizer2CharFilter-4752cad.zip


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2014-03-12 Thread David Goldfarb (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Goldfarb updated LUCENE-4072:
---

Attachment: 4072.patch

Attaching a new patch. All tests pass. 

I'm using Normalizer2.isInert to check if we need to keep reading to the input 
buffer since it doesn't return false positives, even though it's not as fast as 
.hasBoundaryBefore().

 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: 4072.patch, 4072.patch, DebugCode.txt, 
 LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, 
 LUCENE-4072.patch, LUCENE-4072.patch, 
 ippeiukai-ICUNormalizer2CharFilter-4752cad.zip


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2014-01-27 Thread David Goldfarb (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Goldfarb updated LUCENE-4072:
---

Attachment: 4072.patch

Attaching a new patch - testCuriousString still fails. 

You're right about readInputToBuffer. I think we also have to stop only on 
normalization boundaries. I see two options:
use normalizer.hasBoundaryAfter(tmpBuffer\[len-1\]) (straightforward)
or
use normalizer.hasBoundaryBefore(tmpBuffer\[len-1\]) and use mark() and reset().

{noformat}
  private int readInputToBuffer() throws IOException {
final int len = input.read(tmpBuffer);
if (len == -1) {
  inputFinished = true;
  return 0;
}
inputBuffer.append(tmpBuffer, 0, len);
if (len = 2  normalizer.hasBoundaryAfter(tmpBuffer[len-1])  
!Character.isHighSurrogate(tmpBuffer[len-1])) {
return len;
} else return len + readInputToBuffer();
  }
{noformat}

 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: 4072.patch, DebugCode.txt, LUCENE-4072.patch, 
 LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, 
 LUCENE-4072.patch, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2013-12-23 Thread David Goldfarb (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Goldfarb updated LUCENE-4072:
---

Attachment: LUCENE-4072.patch

This patch dodges the use of hasBoundaryAfter, and the tests pass.

Note in doTestMode there's a clause that checks if the normalized string has 
length zero. It seems the nfkc_cf-normalized output of some strings is empty. 
Examples I found:
'\uDB40\uDCD9'
'\uDB43\uDF86'
'\uFE04'

 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: DebugCode.txt, LUCENE-4072.patch, LUCENE-4072.patch, 
 LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, 
 ippeiukai-ICUNormalizer2CharFilter-4752cad.zip


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2013-12-23 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-4072:


Attachment: LUCENE-4072.patch

Thanks so much for attacking this David: I think that 0-length all default 
ignorables case makes sense (where it creates an empty string), because in 
that case there won't be a single token at all (MockTokenizer is not a perfect 
emulator of KeywordTokenizer here).

I think this patch is close, but when running the test a few hundred times I 
hit a failure (see my added testCuriousString, which fails). I think this one 
is a bug in the logic.

Motivated by this fail, I tried to beef up tests in general:
* fixed my typo where testNFD wasnt actually testing NFD
* test strings  20 characters, since this filter has an internal 128-char 
buffer.

The latter seems to expose a lot of bugs, I assume due to the internal 
buffering. I haven't yet looked into this. But it seems there are correctness 
issues for documents  128 chars (as well as what I believe is a separate bug 
seen by testCuriousString, which I think is some bug in the logic related to 
ignorables).


 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: DebugCode.txt, LUCENE-4072.patch, LUCENE-4072.patch, 
 LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, 
 ippeiukai-ICUNormalizer2CharFilter-4752cad.zip


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2013-10-30 Thread David Goldfarb (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Goldfarb updated LUCENE-4072:
---

Attachment: LUCENE-4072.patch

Indeed, changing the code to iterate over codepoints fixed a majority of the 
test failures.

The random tests still fail sometimes -- I believe there's a bug in 
Normalizer2. I submitted a bug report 
[here|http://bugs.icu-project.org/trac/ticket/10524#propertyform].

 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: DebugCode.txt, 
 ippeiukai-ICUNormalizer2CharFilter-4752cad.zip, LUCENE-4072.patch, 
 LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2013-10-27 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-4072:


Attachment: LUCENE-4072.patch

I looked over the patch, and added license headers and so on.

I also added some new tests, which currently fail. I think the problem is that 
the current logic iterates characters (e.g. passing charAt(x) to 
hasBoundaryBefore and so on), when it should be passing codepoints to these 
methods.

 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: DebugCode.txt, 
 ippeiukai-ICUNormalizer2CharFilter-4752cad.zip, LUCENE-4072.patch, 
 LUCENE-4072.patch, LUCENE-4072.patch


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2013-10-18 Thread David Goldfarb (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Goldfarb updated LUCENE-4072:
---

Attachment: LUCENE-4072.patch

I'm available to help make this work. I updated [~ippei]'s code to use 4.0 API 
(CharStream, CharReader, ReusableAnalyzerBase affected). I updated [~rcmuir]'s 
random input test and it's not failing. I'm not sure if Ippei's last fix worked 
and this ought to have been closed then. I don't see this class in the Lucene 
library.

Let me know if this helps.

 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: DebugCode.txt, 
 ippeiukai-ICUNormalizer2CharFilter-4752cad.zip, LUCENE-4072.patch, 
 LUCENE-4072.patch


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2012-05-26 Thread Ippei UKAI (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ippei UKAI updated LUCENE-4072:
---

Attachment: DebugCode.txt

How I debugged for a reference.

 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: DebugCode.txt, LUCENE-4072.patch, 
 ippeiukai-ICUNormalizer2CharFilter-4752cad.zip


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2012-05-25 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-4072:


Attachment: LUCENE-4072.patch

attached is the filter, turned into a patch.

however, I added an additional random test and it currently fails... will look 
into this more.

 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: LUCENE-4072.patch, 
 ippeiukai-ICUNormalizer2CharFilter-4752cad.zip


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

2012-05-22 Thread Ippei UKAI (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ippei UKAI updated LUCENE-4072:
---

Attachment: ippeiukai-ICUNormalizer2CharFilter-4752cad.zip

 CharFilter that Unicode-normalizes input
 

 Key: LUCENE-4072
 URL: https://issues.apache.org/jira/browse/LUCENE-4072
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Ippei UKAI
 Attachments: ippeiukai-ICUNormalizer2CharFilter-4752cad.zip


 I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
 The benefit of having this process as CharFilter is that tokenizer can work 
 on normalised text while offset-correction ensuring fast vector highlighter 
 and other offset-dependent features do not break.
 The implementation is available at following repository:
 https://github.com/ippeiukai/ICUNormalizer2CharFilter
 Unfortunately this is my unpaid side-project and cannot spend much time to 
 merge my work to Lucene to make appropriate patch. I'd appreciate it if 
 anyone could give it a go. I'm happy to relicense it to whatever that meets 
 your needs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org