[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-4072: Attachment: LUCENE-4072.patch Whew, thank you! I did some minor cleanup: I toned down the tests i had added that were very slow (added multiplier, so they will do more work in jenkins), added testMassiveLigature (just to test the case where normalization increases the length), and removed the stuff around reset()... since mark isnt supported the default UOE is the right thing. I'll commit shortly CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: 4072.patch, 4072.patch, DebugCode.txt, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Goldfarb updated LUCENE-4072: --- Attachment: 4072.patch Attaching a new patch. All tests pass. I'm using Normalizer2.isInert to check if we need to keep reading to the input buffer since it doesn't return false positives, even though it's not as fast as .hasBoundaryBefore(). CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: 4072.patch, 4072.patch, DebugCode.txt, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Goldfarb updated LUCENE-4072: --- Attachment: 4072.patch Attaching a new patch - testCuriousString still fails. You're right about readInputToBuffer. I think we also have to stop only on normalization boundaries. I see two options: use normalizer.hasBoundaryAfter(tmpBuffer\[len-1\]) (straightforward) or use normalizer.hasBoundaryBefore(tmpBuffer\[len-1\]) and use mark() and reset(). {noformat} private int readInputToBuffer() throws IOException { final int len = input.read(tmpBuffer); if (len == -1) { inputFinished = true; return 0; } inputBuffer.append(tmpBuffer, 0, len); if (len = 2 normalizer.hasBoundaryAfter(tmpBuffer[len-1]) !Character.isHighSurrogate(tmpBuffer[len-1])) { return len; } else return len + readInputToBuffer(); } {noformat} CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: 4072.patch, DebugCode.txt, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Goldfarb updated LUCENE-4072: --- Attachment: LUCENE-4072.patch This patch dodges the use of hasBoundaryAfter, and the tests pass. Note in doTestMode there's a clause that checks if the normalized string has length zero. It seems the nfkc_cf-normalized output of some strings is empty. Examples I found: '\uDB40\uDCD9' '\uDB43\uDF86' '\uFE04' CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: DebugCode.txt, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-4072: Attachment: LUCENE-4072.patch Thanks so much for attacking this David: I think that 0-length all default ignorables case makes sense (where it creates an empty string), because in that case there won't be a single token at all (MockTokenizer is not a perfect emulator of KeywordTokenizer here). I think this patch is close, but when running the test a few hundred times I hit a failure (see my added testCuriousString, which fails). I think this one is a bug in the logic. Motivated by this fail, I tried to beef up tests in general: * fixed my typo where testNFD wasnt actually testing NFD * test strings 20 characters, since this filter has an internal 128-char buffer. The latter seems to expose a lot of bugs, I assume due to the internal buffering. I haven't yet looked into this. But it seems there are correctness issues for documents 128 chars (as well as what I believe is a separate bug seen by testCuriousString, which I think is some bug in the logic related to ignorables). CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: DebugCode.txt, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Goldfarb updated LUCENE-4072: --- Attachment: LUCENE-4072.patch Indeed, changing the code to iterate over codepoints fixed a majority of the test failures. The random tests still fail sometimes -- I believe there's a bug in Normalizer2. I submitted a bug report [here|http://bugs.icu-project.org/trac/ticket/10524#propertyform]. CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: DebugCode.txt, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-4072: Attachment: LUCENE-4072.patch I looked over the patch, and added license headers and so on. I also added some new tests, which currently fail. I think the problem is that the current logic iterates characters (e.g. passing charAt(x) to hasBoundaryBefore and so on), when it should be passing codepoints to these methods. CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: DebugCode.txt, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Goldfarb updated LUCENE-4072: --- Attachment: LUCENE-4072.patch I'm available to help make this work. I updated [~ippei]'s code to use 4.0 API (CharStream, CharReader, ReusableAnalyzerBase affected). I updated [~rcmuir]'s random input test and it's not failing. I'm not sure if Ippei's last fix worked and this ought to have been closed then. I don't see this class in the Lucene library. Let me know if this helps. CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: DebugCode.txt, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip, LUCENE-4072.patch, LUCENE-4072.patch I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ippei UKAI updated LUCENE-4072: --- Attachment: DebugCode.txt How I debugged for a reference. CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: DebugCode.txt, LUCENE-4072.patch, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-4072: Attachment: LUCENE-4072.patch attached is the filter, turned into a patch. however, I added an additional random test and it currently fails... will look into this more. CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: LUCENE-4072.patch, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input
[ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ippei UKAI updated LUCENE-4072: --- Attachment: ippeiukai-ICUNormalizer2CharFilter-4752cad.zip CharFilter that Unicode-normalizes input Key: LUCENE-4072 URL: https://issues.apache.org/jira/browse/LUCENE-4072 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Ippei UKAI Attachments: ippeiukai-ICUNormalizer2CharFilter-4752cad.zip I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. The implementation is available at following repository: https://github.com/ippeiukai/ICUNormalizer2CharFilter Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org