[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-29 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218975#comment-13218975
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

I think we can mark this one as resolved, just I'd keep this only for trunk and 
backport the whole thing to 3.x once SOLR-3013 is resolved and committed to 
trunk too.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_rsrel.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-25 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216459#comment-13216459
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

the two methods analyzeText() and analyzeInput() are confusing so the first one 
should just be renamed as initializeIterator() as its main purpose is to 
prepare the FSIterator which holds the annotations that will be used inside the 
incrementToken() method.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_rsrel.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-22 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214093#comment-13214093
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

After some more testing I think the CasPool is good just for scenarios where 
the pool serves different CAS to different clients (the tokenizers), so not 
really helpful in the current implementation, however it may be useful if we 
abstract the operation of obtaining and releasing a CAS outside the 
BaseTokenizer.

In the meantime I noticed the AEProviderFactory getAEProvider() methods have a 
keyPrefix parameter that came from Solr implementation and was intended to hold 
the core name, so, at the moment I think it'd be better to have (also) methods 
which don't need that paramater for the Lucene uses.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_rsrel.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-16 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209490#comment-13209490
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

bq. But the question is: is it safe to use CAS/AE after you call 
release()/destroy() on them?

no it isn't, so you're right: those methods should not be inside the close() 
method.




> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_rsrel.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-16 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209474#comment-13209474
 ] 

Robert Muir commented on LUCENE-3731:
-

Right, after you reset(Reader) you set a new reader.

But the question is: is it safe to use CAS/AE after you call 
release()/destroy() on them?

Because close() is called on tokenstreams after each invocation, in other words:
{noformat}
Tokenizer t = new Tokenizer(reader);
... stuff ...
t.close();
t.reset(someOtherReader);
.. stuff ...
t.close();
{noformat}

So what does CAS.release() really mean? If it means you should not use the CAS 
again afterwards,
then we cannot have it in TokenStream.close(), and same with AE.destroy()


> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_rsrel.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-16 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209304#comment-13209304
 ] 

Robert Muir commented on LUCENE-3731:
-

Is that safe to do in Tokenizer.close() ?

Because Tokenizer.close() is misleading/confusing, the instance is still reused 
after 
this for subsequent documents... in other words Tokenizer.close() closes 
resources like
the Reader itself... it just happens to be that CAS/AE don't complain about you 
continuing to use them after they are release()'ed/destroy()'ed :)


> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-16 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209301#comment-13209301
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

some improvement in performance came out releasing the CAS and AE on close() 
call

{noformat}
  @Override
  public void close() throws IOException {
super.close();
// release UIMA resources
cas.release();
ae.destroy();
  }
{noformat}

Now investigating the use of CASPool for improving throughput on high usages 
scenarios.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-16 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209247#comment-13209247
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

Right, everything seems ok now.
I also tried to comment the 
{noformat}

{noformat}
line in build.xml in order to execute tests in parallel.
Multiple parallel tests executions, with also -Dtests.multiplier=100, with 
Java6 passed flawlessly; will see if that is the case for Java7 too.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-15 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208784#comment-13208784
 ] 

Robert Muir commented on LUCENE-3731:
-

Thanks Tommaso: i committed this.

Also a tiny change to end() methods:
{code}
   public void end() throws IOException {
-if (offsetAttr.endOffset() < finalOffset)
-  offsetAttr.setOffset(finalOffset, finalOffset);
+offsetAttr.setOffset(finalOffset, finalOffset);
 super.end();
   }
{code}

Unless there is a bug, we should not need the if...
Not sure if we should be reading the attribute values at this
stage and if thats defined either, and if endOffset is somehow
past the reader's final offset, well we are already in trouble :)

I ran the tests many times and with -Dtests.multiplier=100 and there
were no issues.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-15 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208766#comment-13208766
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

bq. OK, if there is no objection I will commit this one.

+1, I'll post my progress on other possible improvements in performances I'm 
testing later.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-15 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208761#comment-13208761
 ] 

Robert Muir commented on LUCENE-3731:
-

OK, if there is no objection I will commit this one. 

I think it will fix the jenkins fails... of course sometimes it takes
a few days of jenkins chewing on it to be sure

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-15 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208753#comment-13208753
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

Thanks Robert for taking care of this, nice improvement :)
I agree on the OverridingParams extending the base one, it was also my intent 
to do that.


> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch, 
> LUCENE-3731_speed.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-15 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208609#comment-13208609
 ] 

Robert Muir commented on LUCENE-3731:
-

Tommaso, I will make another prototype patch trying this approach.

In my opinion the caching done in BasicAEProvider/OverridingParamsAEProvider 
would still useful even with 
allowing each tokenstream to have a new AE, because we would just cache the 
description itself 
(so we e.g. only parse xml a single time), butreturn a new AE each time... then 
we could remove the 
synchronized and still avoid a 'heavy' construction for first time 
initialization of a new thread 
(after that, the tokenstream is reused, so there is no issue).

Ill see how it goes and upload a patch if I can make it look nice.


> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-15 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208595#comment-13208595
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

Hi Robert,
reusing the CAS is good, as you note in the patch we need to take care of how 
to let each tokenizer instance get its own AE, in the previous Solr version 
core names were used to cache and get AEs.
As said on dev@ we may start with letting each tokenizer have its own AE and 
then improve the design once concurrency is fixed.
I'm doing tests with other types of UIMA Flow controllers, right now the 
WhiteboardFlowController seems to behave slightly better.


> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch, LUCENE-3731_speed.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-15 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208460#comment-13208460
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

fix for the issues reported by Steven committed in r1244474

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-15 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208393#comment-13208393
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

Ok, I noticed this was due to an issue on the UIMA side.
I think the best option (as those are used just for testing) is to use a dummy 
implementation of both UIMA based whitespace tokenizer and PoS tagger thus also 
avoiding the log lines when executing tests using Maven.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-14 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208145#comment-13208145
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

Thank you very much Steven for reporting.

The 
{noformat}
Feb 14, 2012 6:34:18 PM WhitespaceTokenizer initialize
INFO: "Whitespace tokenizer successfully initialized"
Feb 14, 2012 6:34:18 PM WhitespaceTokenizer typeSystemInit
INFO: "Whitespace tokenizer typesystem initialized"
{noformat}

messages are due to UIMA WhitespaceTokenizer Annotator which logs the 
initialization/processing/etc. calls.
That is printed out many times because the testRandomStrings test method just 
does lots of tricky tests on the UIMATokenizer which require the above calls to 
be executed repeatedly.

I'll take a look to the other failures which didn't show up on the tests I had 
done till now.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-14 Thread Steven Rowe (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208129#comment-13208129
 ] 

Steven Rowe commented on LUCENE-3731:
-

Hi Tommaso,

I just committed modifications to the IntelliJ IDEA and Maven configurations.

Something strange is happening, though: one test method consistently fails 
under both IntelliJ and Maven: {{UIMABaseAnalyzerTest.testRandomStrings()}}.  
However, under Ant, this always succeeds, including with the seeds that fail 
under either IntelliJ or Maven.  Also, under both IntelliJ and Maven, the 
following sequence is printed out literally thousands of times to STDERR (with 
increasing time stamps) - however, I don't see this at all under Ant:

{noformat}
Feb 14, 2012 6:34:18 PM WhitespaceTokenizer initialize
INFO: "Whitespace tokenizer successfully initialized"
Feb 14, 2012 6:34:18 PM WhitespaceTokenizer typeSystemInit
INFO: "Whitespace tokenizer typesystem initialized"
Feb 14, 2012 6:34:18 PM WhitespaceTokenizer process
INFO: "Whitespace tokenizer starts processing"
Feb 14, 2012 6:34:18 PM WhitespaceTokenizer process
INFO: "Whitespace tokenizer finished processing"
{noformat}

Here are two different example failures, from Maven - they seem to have 
different causes, which is baffling:

{noformat}
The following exceptions were thrown by threads:
*** Thread: Thread-1 ***
java.lang.RuntimeException: java.io.IOException: 
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator 
processing failed.
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:289)
Caused by: java.io.IOException: 
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator 
processing failed.
at 
org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:73)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:333)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287)
Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException: 
Annotator processing failed.
at 
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:391)
at 
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
at 
org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
at 
org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:409)
at 
org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
at 
org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
at 
org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at 
org.apache.lucene.analysis.uima.BaseUIMATokenizer.analyzeInput(BaseUIMATokenizer.java:57)
at 
org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.analyzeText(UIMAAnnotationsTokenizer.java:61)
at 
org.apache.lucene.analysis.uima.UIMAAnnotationsTokenizer.incrementToken(UIMAAnnotationsTokenizer.java:71)
... 3 more
Caused by: java.lang.NullPointerException
at 
org.apache.uima.impl.UimaContext_ImplBase$ComponentInfoImpl.mapToSofaID(UimaContext_ImplBase.java:655)
at org.apache.uima.cas.impl.CASImpl.getView(CASImpl.java:2646)
at org.apache.uima.jcas.impl.JCasImpl.getView(JCasImpl.java:1415)
at org.apache.uima.examples.tagger.HMMTagger.process(HMMTagger.java:250)
at 
org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at 
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
... 12 more
*** Thread: Thread-2 ***
java.lang.AssertionError: token 0 does not exist
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.assertTrue(Assert.java:43)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:121)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:371)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:295)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:287)
NOTE: reproduce with: ant test -Dtestcase=UIMABaseAnalyzerTest 
-Dt

[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-14 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208076#comment-13208076
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

committed on trunk in r1244236

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-14 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208075#comment-13208075
 ] 

Robert Muir commented on LUCENE-3731:
-

Thanks for factoring this out Tommaso.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-14 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208058#comment-13208058
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

I'm going to commit this one shortly if no one objects.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch, LUCENE-3731_4.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-13 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206997#comment-13206997
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

bq. Hi Tommaso, I think it would be cleaner to set the final offset in end() 
instead?

ok, +1.

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-13 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206837#comment-13206837
 ] 

Robert Muir commented on LUCENE-3731:
-

Hi Tommaso, I think it would be cleaner to set the final offset in end() 
instead?


> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch, LUCENE-3731_2.patch, 
> LUCENE-3731_3.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-03 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199893#comment-13199893
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

Hey Robert, that's super, thanks! I'm going to collect your suggestions in a 
new patch shortly.


> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-03 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199778#comment-13199778
 ] 

Robert Muir commented on LUCENE-3731:
-

Thanks for starting this Tommaso:

I was unable to apply the patch (were there some svn-copies?)

But I suggest in general using the 
BaseTokenStreamTestCase.assertTokenStreamContents/assertAnalyzesTo:
e.g. instead of:
{code}
// check that 'the big brown fox jumped on the wood' tokens have the expected 
PoS types
   String[] expectedPos = new String[]{"at", "jj", "jj", "nn", "vbd", "in", 
"at", "nn"};
   int i = 0;
   while (ts.incrementToken()) {
 assertNotNull(offsetAtt);
 assertNotNull(termAtt);
 assertNotNull(typeAttr);
 assertEquals(typeAttr.type(), expectedPos[i]);
 i++;
   }
{code}

you could use:
{code}
   assertTokenStreamContents(ts, 
 new String[] { "the", "big", "brown", ... }, /* expected terms */
 new String[] { "at", "jj", "jj", ... }, /* expected types */
{code}

There are also variants that let you supply expected start/end offsets, I think 
that would be good.

Finally, to check for lots of other bugs (including thread-safety, 
compatibility with charfilters, etc),
I would recommend:
{code}
  /** blast some random strings through the analyzer */
  public void testRandomStrings() throws Exception {
Analyzer a = new Analyzer() {

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader 
reader) {
Tokenizer tokenizer = new MyTokenizer(reader);
return new TokenStreamComponents(tokenizer, tokenizer);
  } 
};
checkRandomData(random, a, 1*RANDOM_MULTIPLIER);
  }
{code}

If you look at BaseTokenStreamTestCase you will see all of these methods are 
insanely nitpicky
and find all kinds of bugs in analysis components, so it will really help test 
coverage I think.


> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-03 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199771#comment-13199771
 ] 

Tommaso Teofili commented on LUCENE-3731:
-

right Uwe, thanks so much for the quick review :)

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3731) Create a analysis/uima module for UIMA based tokenizers/analyzers

2012-02-03 Thread Uwe Schindler (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199770#comment-13199770
 ] 

Uwe Schindler commented on LUCENE-3731:
---

Hi,

{code}
+  clearAttributes();
+  AnnotationFS next = iterator.next();
+  termAttr.setEmpty();
+  termAttr.append(next.getCoveredText());
+  termAttr.setLength(next.getCoveredText().length());
{code}

As you clear the attributes already, the length of termAttr is 0, so setEmpty 
is not needed. termAttr.setLength() is also not useful, as append will 
initialize the length already. All you need is 
termAttr.append(next.getCoveredText());

Uwe

> Create a analysis/uima module for UIMA based tokenizers/analyzers
> -
>
> Key: LUCENE-3731
> URL: https://issues.apache.org/jira/browse/LUCENE-3731
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3731.patch
>
>
> As discussed in SOLR-3013 the UIMA Tokenizers/Analyzer should be refactored 
> out in a separate module (modules/analysis/uima) as they can be used in plain 
> Lucene. Then the solr/contrib/uima will contain only the related factories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org