[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218975#comment-13218975 ] Tommaso Teofili commented on LUCENE-3731: - I think we can mark this one as resolved, just I'd keep this only for trunk and backport the whole thing to 3.x once SOLR-3013 is resolved and committed to trunk too. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_rsrel.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216459#comment-13216459 ] Tommaso Teofili commented on LUCENE-3731: - the two methods analyzeText() and analyzeInput() are confusing so the first one should just be renamed as initializeIterator() as its main purpose is to prepare the FSIterator which holds the annotations that will be used inside the incrementToken() method. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_rsrel.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214093#comment-13214093 ] Tommaso Teofili commented on LUCENE-3731: - After some more testing I think the CasPool is good just for scenarios where the pool serves different CAS to different clients (the tokenizers), so not really helpful in the current implementation, however it may be useful if we abstract the operation of obtaining and releasing a CAS outside the BaseTokenizer. In the meantime I noticed the AEProviderFactory getAEProvider() methods have a keyPrefix parameter that came from Solr implementation and was intended to hold the core name, so, at the moment I think it'd be better to have (also) methods which don't need that paramater for the Lucene uses. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_rsrel.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209490#comment-13209490 ] Tommaso Teofili commented on LUCENE-3731: - bq. But the question is: is it safe to use CAS/AE after you call release()/destroy() on them? no it isn't, so you're right: those methods should not be inside the close() method. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_rsrel.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209474#comment-13209474 ] Robert Muir commented on LUCENE-3731: - Right, after you reset(Reader) you set a new reader. But the question is: is it safe to use CAS/AE after you call release()/destroy() on them? Because close() is called on tokenstreams after each invocation, in other words: {noformat} Tokenizer t = new Tokenizer(reader); ... stuff ... t.close(); t.reset(someOtherReader); .. stuff ... t.close(); {noformat} So what does CAS.release() really mean? If it means you should not use the CAS again afterwards, then we cannot have it in TokenStream.close(), and same with AE.destroy() > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_rsrel.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209304#comment-13209304 ] Robert Muir commented on LUCENE-3731: - Is that safe to do in Tokenizer.close() ? Because Tokenizer.close() is misleading/confusing, the instance is still reused after this for subsequent documents... in other words Tokenizer.close() closes resources like the Reader itself... it just happens to be that CAS/AE don't complain about you continuing to use them after they are release()'ed/destroy()'ed :) > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209301#comment-13209301 ] Tommaso Teofili commented on LUCENE-3731: - some improvement in performance came out releasing the CAS and AE on close() call {noformat} @Override public void close() throws IOException { super.close(); // release UIMA resources cas.release(); ae.destroy(); } {noformat} Now investigating the use of CASPool for improving throughput on high usages scenarios. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209247#comment-13209247 ] Tommaso Teofili commented on LUCENE-3731: - Right, everything seems ok now. I also tried to comment the {noformat} {noformat} line in build.xml in order to execute tests in parallel. Multiple parallel tests executions, with also -Dtests.multiplier=100, with Java6 passed flawlessly; will see if that is the case for Java7 too. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208784#comment-13208784 ] Robert Muir commented on LUCENE-3731: - Thanks Tommaso: i committed this. Also a tiny change to end() methods: {code} public void end() throws IOException { -if (offsetAttr.endOffset() < finalOffset) - offsetAttr.setOffset(finalOffset, finalOffset); +offsetAttr.setOffset(finalOffset, finalOffset); super.end(); } {code} Unless there is a bug, we should not need the if... Not sure if we should be reading the attribute values at this stage and if thats defined either, and if endOffset is somehow past the reader's final offset, well we are already in trouble :) I ran the tests many times and with -Dtests.multiplier=100 and there were no issues. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208766#comment-13208766 ] Tommaso Teofili commented on LUCENE-3731: - bq. OK, if there is no objection I will commit this one. +1, I'll post my progress on other possible improvements in performances I'm testing later. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208761#comment-13208761 ] Robert Muir commented on LUCENE-3731: - OK, if there is no objection I will commit this one. I think it will fix the jenkins fails... of course sometimes it takes a few days of jenkins chewing on it to be sure > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208753#comment-13208753 ] Tommaso Teofili commented on LUCENE-3731: - Thanks Robert for taking care of this, nice improvement :) I agree on the OverridingParams extending the base one, it was also my intent to do that. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, > LUCENE-3731_speed.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208609#comment-13208609 ] Robert Muir commented on LUCENE-3731: - Tommaso, I will make another prototype patch trying this approach. In my opinion the caching done in BasicAEProvider/OverridingParamsAEProvider would still useful even with allowing each tokenstream to have a new AE, because we would just cache the description itself (so we e.g. only parse xml a single time), butreturn a new AE each time... then we could remove the synchronized and still avoid a 'heavy' construction for first time initialization of a new thread (after that, the tokenstream is reused, so there is no issue). Ill see how it goes and upload a patch if I can make it look nice. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208595#comment-13208595 ] Tommaso Teofili commented on LUCENE-3731: - Hi Robert, reusing the CAS is good, as you note in the patch we need to take care of how to let each tokenizer instance get its own AE, in the previous Solr version core names were used to cache and get AEs. As said on dev@ we may start with letting each tokenizer have its own AE and then improve the design once concurrency is fixed. I'm doing tests with other types of UIMA Flow controllers, right now the WhiteboardFlowController seems to behave slightly better. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208460#comment-13208460 ] Tommaso Teofili commented on LUCENE-3731: - fix for the issues reported by Steven committed in r1244474 > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208393#comment-13208393 ] Tommaso Teofili commented on LUCENE-3731: - Ok, I noticed this was due to an issue on the UIMA side. I think the best option (as those are used just for testing) is to use a dummy implementation of both UIMA based whitespace tokenizer and PoS tagger thus also avoiding the log lines when executing tests using Maven. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208145#comment-13208145 ] Tommaso Teofili commented on LUCENE-3731: - Thank you very much Steven for reporting. The {noformat} Feb 14, 2012 6:34:18 PM WhitespaceTokenizer initialize INFO: "Whitespace tokenizer successfully initialized" Feb 14, 2012 6:34:18 PM WhitespaceTokenizer typeSystemInit INFO: "Whitespace tokenizer typesystem initialized" {noformat} messages are due to UIMA WhitespaceTokenizer Annotator which logs the initialization/processing/etc. calls. That is printed out many times because the testRandomStrings test method just does lots of tricky tests on the UIMATokenizer which require the above calls to be executed repeatedly. I'll take a look to the other failures which didn't show up on the tests I had done till now. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208129#comment-13208129 ] Steven Rowe commented on LUCENE-3731: - Hi Tommaso, I just committed modifications to the IntelliJ IDEA and Maven configurations. Something strange is happening, though: one test method consistently fails under both IntelliJ and Maven: {{UIMABaseAnalyzerTest.testRandomStrings()}}. However, under Ant, this always succeeds, including with the seeds that fail under either IntelliJ or Maven. Also, under both IntelliJ and Maven, the following sequence is printed out literally thousands of times to STDERR (with increasing time stamps) - however, I don't see this at all under Ant: {noformat} Feb 14, 2012 6:34:18 PM WhitespaceTokenizer initialize INFO: "Whitespace tokenizer successfully initialized" Feb 14, 2012 6:34:18 PM WhitespaceTokenizer typeSystemInit INFO: "Whitespace tokenizer typesystem initialized" Feb 14, 2012 6:34:18 PM WhitespaceTokenizer process INFO: "Whitespace tokenizer starts processing" Feb 14, 2012 6:34:18 PM WhitespaceTokenizer process INFO: "Whitespace tokenizer finished processing" {noformat} Here are two different example failures, from Maven - they seem to have different causes, which is baffling: {noformat} The following exceptions were thrown by threads: *** Thread: Thread-1 *** java.lang.RuntimeException: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed. at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:289) Caused by: java.io.IOException: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed. at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:73) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:333) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295) at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287) Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed. at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:391) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:409) at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342) at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267) at org.apache.lucene.analysis.uima.BaseUIMATokenizer.analyzeInput(BaseUIMATokenizer.java:57) at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.analyzeText(UIMAAnnotationsTokenizer.java:61) at org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:71) ... 3 more Caused by: java.lang.NullPointerException at org.apache.uima.impl.UimaContext_ImplBase$ComponentInfoImpl.mapToSofaID(UimaContext_ImplBase.java:655) at org.apache.uima.cas.impl.CASImpl.getView(CASImpl.java:2646) at org.apache.uima.jcas.impl.JCasImpl.getView(JCasImpl.java:1415) at org.apache.uima.examples.tagger.HMMTagger.process(HMMTagger.java:250) at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377) ... 12 more *** Thread: Thread-2 *** java.lang.AssertionError: token 0 does not exist at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:121) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371) at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295) at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287) NOTE: reproduce with: ant test -Dtestcase=UIMABaseAnalyzerTest -Dt
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208076#comment-13208076 ] Tommaso Teofili commented on LUCENE-3731: - committed on trunk in r1244236 > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208075#comment-13208075 ] Robert Muir commented on LUCENE-3731: - Thanks for factoring this out Tommaso. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208058#comment-13208058 ] Tommaso Teofili commented on LUCENE-3731: - I'm going to commit this one shortly if no one objects. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch, LUCENE-3731_4.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206997#comment-13206997 ] Tommaso Teofili commented on LUCENE-3731: - bq. Hi Tommaso, I think it would be cleaner to set the final offset in end() instead? ok, +1. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206837#comment-13206837 ] Robert Muir commented on LUCENE-3731: - Hi Tommaso, I think it would be cleaner to set the final offset in end() instead? > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, > LUCENE-3731_3.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199893#comment-13199893 ] Tommaso Teofili commented on LUCENE-3731: - Hey Robert, that's super, thanks! I'm going to collect your suggestions in a new patch shortly. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199778#comment-13199778 ] Robert Muir commented on LUCENE-3731: - Thanks for starting this Tommaso: I was unable to apply the patch (were there some svn-copies?) But I suggest in general using the BaseTokenStreamTestCase.assertTokenStreamContents/assertAnalyzesTo: e.g. instead of: {code} // check that 'the big brown fox jumped on the wood' tokens have the expected PoS types String[] expectedPos = new String[]{"at", "jj", "jj", "nn", "vbd", "in", "at", "nn"}; int i = 0; while (ts.incrementToken()) { assertNotNull(offsetAtt); assertNotNull(termAtt); assertNotNull(typeAttr); assertEquals(typeAttr.type(), expectedPos[i]); i++; } {code} you could use: {code} assertTokenStreamContents(ts, new String[] { "the", "big", "brown", ... }, /* expected terms */ new String[] { "at", "jj", "jj", ... }, /* expected types */ {code} There are also variants that let you supply expected start/end offsets, I think that would be good. Finally, to check for lots of other bugs (including thread-safety, compatibility with charfilters, etc), I would recommend: {code} /** blast some random strings through the analyzer */ public void testRandomStrings() throws Exception { Analyzer a = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer tokenizer = new MyTokenizer(reader); return new TokenStreamComponents(tokenizer, tokenizer); } }; checkRandomData(random, a, 1*RANDOM_MULTIPLIER); } {code} If you look at BaseTokenStreamTestCase you will see all of these methods are insanely nitpicky and find all kinds of bugs in analysis components, so it will really help test coverage I think. > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199771#comment-13199771 ] Tommaso Teofili commented on LUCENE-3731: - right Uwe, thanks so much for the quick review :) > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199770#comment-13199770 ] Uwe Schindler commented on LUCENE-3731: --- Hi, {code} + clearAttributes(); + AnnotationFS next = iterator.next(); + termAttr.setEmpty(); + termAttr.append(next.getCoveredText()); + termAttr.setLength(next.getCoveredText().length()); {code} As you clear the attributes already, the length of termAttr is 0, so setEmpty is not needed. termAttr.setLength() is also not useful, as append will initialize the length already. All you need is termAttr.append(next.getCoveredText()); Uwe > Create a analysis/uima module for UIMA based tokenizers/analyzers > - > > Key: LUCENE-3731 > URL: https://issues.apache.org/jira/browse/LUCENE-3731 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis >Reporter: Tommaso Teofili >Assignee: Tommaso Teofili > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3731.patch > > > As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored > out in a separate module (modules/analysis/uima) as they can be used in plain > Lucene. Then the solr/contrib/uima will contain only the related factories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org