[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844916#action_12844916
 ] 

Simon Willnauer commented on LUCENE-2314:
-

Small comment on javadoc wording. 

Maybe like that:
{code}
/**
 * Copies the contents of this AttributeSource to the given AttributeSource.
 * The given instance has to provide all {...@link Attribute}s this instance 
contains. 
 * The actual attribute implementations must be identical in both {...@link 
AttributeSource} instances.
 * Ideally both AttributeSource instances should use the same {...@link 
AttributeFactory} 
 */
{code}




 Add AttributeSource.copyTo(AttributeSource)
 ---

 Key: LUCENE-2314
 URL: https://issues.apache.org/jira/browse/LUCENE-2314
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2314.patch, LUCENE-2314.patch


 One problem with AttributeSource at the moment is the missing insight into 
 AttributeSource.State. If you want to create TokenStreams that inspect 
 cpatured states, you have no chance. Making the contents of State public is a 
 bad idea, as it does not help for inspecting (its a linked list, so you have 
 to iterate).
 AttributeSource currently contains a cloneAttributes() call, which returns a 
 new AttrubuteSource with all current attributes cloned. This is the (more 
 expensive) captureState. The problem is that you cannot copy back the cloned 
 AS (which is the restoreState). To use this behaviour (by the way, 
 ShingleMatrix can use it), one can alternatively use cloneAttributes and 
 copyTo. You can easily change the cloned attributes and store them in lists 
 and copy them back. The only problem is lower performance of these calls (as 
 State is a very optimized class).
 One use case could be:
 {code}
 AttributeSource state = cloneAttributes();
 //  do something ...
 state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
 // ... more work
 state.copyTo(this);
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844927#action_12844927
 ] 

Simon Willnauer commented on LUCENE-2314:
-

looks good to me!

 Add AttributeSource.copyTo(AttributeSource)
 ---

 Key: LUCENE-2314
 URL: https://issues.apache.org/jira/browse/LUCENE-2314
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2314.patch, LUCENE-2314.patch, LUCENE-2314.patch


 One problem with AttributeSource at the moment is the missing insight into 
 AttributeSource.State. If you want to create TokenStreams that inspect 
 cpatured states, you have no chance. Making the contents of State public is a 
 bad idea, as it does not help for inspecting (its a linked list, so you have 
 to iterate).
 AttributeSource currently contains a cloneAttributes() call, which returns a 
 new AttrubuteSource with all current attributes cloned. This is the (more 
 expensive) captureState. The problem is that you cannot copy back the cloned 
 AS (which is the restoreState). To use this behaviour (by the way, 
 ShingleMatrix can use it), one can alternatively use cloneAttributes and 
 copyTo. You can easily change the cloned attributes and store them in lists 
 and copy them back. The only problem is lower performance of these calls (as 
 State is a very optimized class).
 One use case could be:
 {code}
 AttributeSource state = cloneAttributes();
 //  do something ...
 state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
 // ... more work
 state.copyTo(this);
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844420#action_12844420
 ] 

Simon Willnauer commented on LUCENE-2309:
-

The IndexWriter or rather DocInverterPerField are simply an attribute consumer. 
None of them needs to know about Analyzer or TokenStream at all. Neither needs 
the analyzer to iterate over tokens. The IndexWriter should instead implement 
an interface or use a class that is called for each successful 
incrementToken() no matter how this increment is implemented.

I could imagine a really simple interface like
{code}

interface AttributeConsumer {
  
  void setAttributeSource(AttributeSource src);

  void next();

  void end();

}
{code}

IW would then pass itself or an istance it uses (DocInverterPerField) to an API 
expecting such a consumer like:

{code}
field.consume(this);
{code}

or something similar. That way we have not dependency on whatever Attribute 
producer is used. The default implementation is for sure based on an analyzer / 
tokenstream and alternatives can be exposed via expert API. Even Backwards 
compatibility could be solved that way easily.

bq. Only tests would rely on the analyzers module. I think that's OK? core 
itself would have no dependence.
+1 test dependencies should not block modularization, its just about 
configuring the classpath though!



 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844464#action_12844464
 ] 

Simon Willnauer commented on LUCENE-2309:
-

bq. [Carrying over discussions on IRC with Chris Male  Uwe...]

That make it very hard to participate. I can not afford to read through all IRC 
stuff and I don't get the chance to participate directly unless I watch IRC 
constantly. We should really move back to JIRA / devlist for such discussions. 
There is too much going on in IRC.

{quote}
Actually, TokenStream is already AttrSource + incrementing, so it
seems like the right start...
{quote}

But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I 
want to implement something aside the TokenStream delegator API?



 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844523#action_12844523
 ] 

Simon Willnauer commented on LUCENE-2309:
-

bq. Then people could freely use Lucene to index off a foreign analysis chain...
That is what I was talking about!

{quote}
I'd like to donate my two cents here - we've just recently changed the 
TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only 
now the API has changed slightly. The proposals here, w/ the 
AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field 
will call back to IW seems too much complicated to me. Users that write 
Analyzers/TokenStreams/AttributeSources, should not care how they are 
indexed/stored etc. Forcing them to implement this push logic to IW seems to me 
like a real unnecessary overhead and complexity.
{quote}

We can surely hide this implementation completely from field. I consider this 
being similar to Collector where you pass it explicitly to the search method if 
you want to have a different behavior. Maybe something like a 
AttributeProducer. I don't think adding this to field makes a lot of sense at 
all and it is not worth the complexity.

bq. Will the Field also control how stored fields are added? Or only 
AttributeSourced ones?
IMO this is only about inverted fields.

bq. We (IW) control the indexing flow, and not the user.
The user only gets the possibility to exchange the analysis chain but not the 
control flow. The user already can mess around with stuff in incrementToken(), 
the only thing we change / invert is that the indexer does not know about 
TokenStreams anymore. it does not change the controlflow though.



 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2277) QueryNodeImpl throws ConcurrentModificationException on add(ListQueryNode)

2010-03-06 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12842332#action_12842332
 ] 

Simon Willnauer commented on LUCENE-2277:
-

Robert, should the changes text rather say something about the argument that 
was completely ignored. This was simply a bug due to ignoring the argument but 
calling a similar named method. Could be a bit picky but I thought I should 
mention it.

Simon

 QueryNodeImpl throws ConcurrentModificationException on add(ListQueryNode)
 

 Key: LUCENE-2277
 URL: https://issues.apache.org/jira/browse/LUCENE-2277
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 3.0
 Environment: all
Reporter: Frank Wesemann
Assignee: Robert Muir
Priority: Critical
 Fix For: 3.1

 Attachments: addChildren.patch, LUCENE-2277.patch


 on adding a List of children to a QueryNodeImplemention a 
 ConcurrentModificationException is thrown.
 This is due to the fact that QueryNodeImpl instead of iteration over the 
 supplied list, iterates over its internal clauses List.
 Patch:
 Index: QueryNodeImpl.java
 ===
 --- QueryNodeImpl.java(revision 911642)
 +++ QueryNodeImpl.java(working copy)
 @@ -74,7 +74,7 @@

 .getLocalizedMessage(QueryParserMessages.NODE_ACTION_NOT_SUPPORTED));
  }
  
 -for (QueryNode child : getChildren()) {
 +for (QueryNode child : children) {
add(child);
  }
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837752#action_12837752
 ] 

Simon Willnauer commented on LUCENE-2279:
-

bq. Should we deprecate (eventually, remove) Analyzer.tokenStream? 
I would totally agree with that but  I guess we can not remove this method 
until lucene 4.0 which will be hmm in 2020 :) - just joking

bq.Maybe we should absorb ReusableAnalyzerBase back into Analyzer?
That would be the logical consequence but the problem with ReusableAnalyzerBase 
is that it will break bw comapt if moved to Analyzer. It assumes both 
#reusabelTokenStream and #tokenStream to be final and introduces a new factory 
method. Yet, as an analyzer developer you really want to use the new 
ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will 
require you writing half of the code plus gives you reusability of the 
tokenStream

bp. I think Lucene/Solr/Nutch need to eventually get to this point
Huge +1 from my side. This could also unify the factory pattern solr uses to 
build tokenstreams. I would stop right here and ask to discuss it on the dev 
list, thoughts mike?!



 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-23 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837465#action_12837465
 ] 

Simon Willnauer commented on LUCENE-2279:
-

I don't consider this as an issue at all. Each analyzer creating StopFilter 
instances uses CharArraySet internally and if you write your own you should do 
so too. The JavaDoc of StopFilter clearly describes what is going on if you use 
a set in favor of CharArraySet.
You should really consider reusabelTokenStream AND use a CharArraySet instance. 
You should have a look at the current trunk how all the analyzers handle 
stopwords. Once 3.1 is out you will also be able to subclass 
ReusableAnalyzerBase which enables reusableTokenStream on the the fly in 99% of 
the cases.

I tend to close this issue though, Robert?



 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2255) IndexWriter.getReader() allocates file handles

2010-02-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831107#action_12831107
 ] 

Simon Willnauer commented on LUCENE-2255:
-

I see this coming up multiple times, we should document this properly in the 
javadoc and on the wiki. Jason, aren't you the NRT specialist here. What keeps 
you from attaching a patch for the IW javadoc?

simon

 IndexWriter.getReader() allocates file handles
 --

 Key: LUCENE-2255
 URL: https://issues.apache.org/jira/browse/LUCENE-2255
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
 Environment: Ubuntu 9.10, Java 6
Reporter: Mikkel Kamstrup Erlandsen
 Attachments: LuceneManyCommits.java


 I am not sure if this is a bug or really just me not reading the Javadocs 
 right...
 The IR returned by IW.getReader() leaks file handles if you do not close() 
 it, leading to starvation of the available file handles/process. If it was 
 clear from the docs that this was a *new* reader and not some reference owned 
 by the writer then this would probably be ok. But as I read the docs the 
 reader is internally managed by the IW, which at first shot lead me to 
 believe that I shouldn't close it.
 So perhaps the docs should be amended to clearly state that this is a 
 caller-owns reader that *must* be closed? Attaching a simple app that 
 illustrates the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2253) Lucene 3.0 - Deprecated QueryParser Constructor in Demo Code [new QueryParser( contents, analyzer)]

2010-02-07 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2253:


Component/s: Examples
   Priority: Trivial  (was: Major)
 Issue Type: Task  (was: Bug)

Changed issue to Task / Trivial.

Thanks for reporting this.

 Lucene 3.0 - Deprecated QueryParser Constructor in Demo Code [new 
 QueryParser( contents, analyzer)]
 -

 Key: LUCENE-2253
 URL: https://issues.apache.org/jira/browse/LUCENE-2253
 Project: Lucene - Java
  Issue Type: Task
  Components: Examples
Affects Versions: 2.9.1, 3.0
Reporter: Lock Levels
Priority: Trivial
   Original Estimate: 1h
  Remaining Estimate: 1h

 Found this issue when following the getting started tutorial with Lucene 3.0. 
  It appears the QueryParser constructor was deprecated 
 The new code in results.jsp should be changed from:
 new QueryParser(contents, analyzer)
 to:
 new QueryParser(Version.LUCENE_CURRENT, contents, analyzer)
 http://www.locklevels.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2080) Improve the documentation of Version

2010-02-07 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830760#action_12830760
 ] 

Simon Willnauer commented on LUCENE-2080:
-


I like this extension and I think it is important! Yet, I would use the 
following wording instead:

{quote}Additionally, you may need to re-test your entire application to ensure 
it behaves like expected, as some defaults may have changed and may break 
functionality in your application.{quote}


 Improve the documentation of Version
 

 Key: LUCENE-2080
 URL: https://issues.apache.org/jira/browse/LUCENE-2080
 Project: Lucene - Java
  Issue Type: Task
  Components: Javadocs
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Trivial
 Fix For: 2.9.2, 3.0, 3.1

 Attachments: LUCENE-2080.patch, LUCENE-2080.patch, LUCENE-2080.patch


 In my opinion, we should elaborate more on the effects of changing the 
 Version parameter.
 Particularly, changing this value, even if you recompile your code, likely 
 involves reindexing your data.
 I do not think this is adequately clear from the current javadocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2248) Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts

2010-02-06 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830550#action_12830550
 ] 

Simon Willnauer commented on LUCENE-2248:
-

bq. Simon, if you like you can use it as basis and start with contrib. 
will do...

 Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, 
 when development for 3.2 starts
 -

 Key: LUCENE-2248
 URL: https://issues.apache.org/jira/browse/LUCENE-2248
 Project: Lucene - Java
  Issue Type: Test
  Components: Analysis, contrib/*, contrib/analyzers, 
 contrib/benchmark, contrib/highlighter, contrib/spatial, 
 contrib/spellchecker, contrib/wikipedia, Index, Javadocs, Other, 
 Query/Scoring, QueryParser, Search, Store, Term Vectors
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2248.patch, LUCENE-2248.patch


 A lot of tests for the most-recent functionality in Lucene use 
 Version.LUCENE_CURRENT, which is fine in trunk, as we use the most recent 
 version without hassle and changing this in later versions.
 The problem is, if we copy these tests to backwards branch after 3.1 is out 
 and then start to improve analyzers, we then will have the maintenance hell 
 for backwards tests. And we loose backward compatibility testing for older 
 versions. If we would specify a specific version like LUCENE_31 in our tests, 
 after moving to backwards they must work without any changes!
 To not always modify all tests after a new version comes out (e.g. after 
 switching to 3.2 dev), I propose to do the following:
 - declare a static final Version TEST_VERSION = Version.LUCENE_CURRENT (or 
 better) Version.LUCENE_31 in LuceneTestCase(4J).
 - change all tests that use Version.LUCENE_CURRENT using eclipse refactor to 
 use this constant and remove unneeded import statements.
 When we then move the tests to backward we must only change one line, 
 depending on how we define this constant:
 - If in trunk LuceneTestCase it's Version.LUCENE_CURRENT, we just change the 
 backwards branch to use the version numer of the released thing.
 - If trunk already uses the LUCENE_31 constant (I prefer this), we do not 
 need to change backwards, but instead when switching version numbers we just 
 move trunk forward to the next major version (after added to Version enum).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2245) Remaining contrib testcases should use Version based ctors instead of deprecated ones

2010-02-05 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830195#action_12830195
 ] 

Simon Willnauer commented on LUCENE-2245:
-

According to rmuir this will not interrupt LUCENE-2055. Therefore I will commit 
this is a bit if nobody objects.

 Remaining contrib testcases should use Version based ctors instead of 
 deprecated ones
 -

 Key: LUCENE-2245
 URL: https://issues.apache.org/jira/browse/LUCENE-2245
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/*
Affects Versions: 3.1
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2245.patch


 Many testcases in contrib use deprecated ctors for WhitespaceTokenizer / 
 Analyzer etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2248) Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, when development for 3.2 starts

2010-02-04 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829569#action_12829569
 ] 

Simon Willnauer commented on LUCENE-2248:
-

Uwe, as I already said while we where discussion this, I would add the version 
to LuceneTestCase (or equivalent for JU4)  and then we can do the tests in 
sub-issues which prevents those super huge patches.

thoughts?!

 Tests using Version.LUCENE_CURRENT will produce problems in backwards branch, 
 when development for 3.2 starts
 -

 Key: LUCENE-2248
 URL: https://issues.apache.org/jira/browse/LUCENE-2248
 Project: Lucene - Java
  Issue Type: Test
  Components: Analysis, contrib/*, contrib/analyzers, 
 contrib/benchmark, contrib/highlighter, contrib/spatial, 
 contrib/spellchecker, contrib/wikipedia, Index, Javadocs, Other, 
 Query/Scoring, QueryParser, Search, Store, Term Vectors
Reporter: Uwe Schindler
Priority: Minor
 Fix For: 3.1


 A lot of tests for the most-recent functionality in Lucene use 
 Version.LUCENE_CURRENT, which is fine in trunk, as we use the most recent 
 version without hassle and changing this in later versions.
 The problem is, if we copy these tests to backwards branch after 3.1 is out 
 and then start to improve analyzers, we then will have the maintenance hell 
 for backwards tests. And we loose backward compatibility testing for older 
 versions. If we would specify a specific version like LUCENE_31 in our tests, 
 after moving to backwards they must work without any changes!
 To not always modify all tests after a new version comes out (e.g. after 
 switching to 3.2 dev), I propose to do the following:
 - declare a static final Version TEST_VERSION = Version.LUCENE_CURRENT (or 
 better) Version.LUCENE_31 in LuceneTestCase(4J).
 - change all tests that use Version.LUCENE_CURRENT using eclipse refactor to 
 use this constant and remove unneeded import statements.
 When we then move the tests to backward we must only change one line, 
 depending on how we define this constant:
 - If in trunk LuceneTestCase it's Version.LUCENE_CURRENT, we just change the 
 backwards branch to use the version numer of the released thing.
 - If trunk already uses the LUCENE_31 constant (I prefer this), we do not 
 need to change backwards, but instead when switching version numbers we just 
 move trunk forward to the next major version (after added to Version enum).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

2010-02-04 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829687#action_12829687
 ] 

Simon Willnauer commented on LUCENE-2055:
-

Robert, nice work!
I have one comment on StemmerOverrideFilter

The ctor should not always copy the given dictionary dictionary - if is created 
with such a map we should use the given instance. This is similar to StopFilter 
vs. StopAnalyzer.
Maybe a CharArrayMap.castOrCopy(Map?, String) would be handy in that case.


One minor thing, the  null check in DutchAnalyzer seems to be unnecessary but 
anyway thats fine.
{code}
   if (stemdict != null  !stemdict.isEmpty())
{code}
DutchAnalyzer also has an unused import 

{code}
import java.util.Arrays;
{code}

except of those +1 from my side


 Fix buggy stemmers and Remove duplicate analysis functionality
 --

 Key: LUCENE-2055
 URL: https://issues.apache.org/jira/browse/LUCENE-2055
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Reporter: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2055.patch, LUCENE-2055.patch, LUCENE-2055.patch, 
 LUCENE-2055.patch


 would like to remove stemmers in the following packages, and instead in their 
 analyzers use a SnowballStemFilter instead.
 * analyzers/fr
 * analyzers/nl
 * analyzers/ru
 below are excerpts from this code where they proudly proclaim they use the 
 snowball algorithm.
 I think we should delete all of this custom stemming code in favor of the 
 actual snowball package.
 {noformat}
 /**
  * A stemmer for French words. 
  * p
  * The algorithm is based on the work of
  * Dr Martin Porter on his snowball projectbr
  * refer to http://snowball.sourceforge.net/french/stemmer.htmlbr
  * (French stemming algorithm) for details
  * /p
  */
 public class FrenchStemmer {
 /**
  * A stemmer for Dutch words. 
  * p
  * The algorithm is an implementation of
  * the a 
 href=http://snowball.tartarus.org/algorithms/dutch/stemmer.html;dutch 
 stemming/a
  * algorithm in Martin Porter's snowball project.
  * /p
  */
 public class DutchStemmer {
 /**
  * Russian stemming algorithm implementation (see 
 http://snowball.sourceforge.net for detailed description).
  */
 class RussianStemmer
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2245) Remaining contrib testcases should use Version based ctors instead of deprecated ones

2010-02-04 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829794#action_12829794
 ] 

Simon Willnauer commented on LUCENE-2245:
-

I will hold off with this patch until LUCENE-2055 is committed don't wanna 
interrupt roberts work with this cleanup here. 

 Remaining contrib testcases should use Version based ctors instead of 
 deprecated ones
 -

 Key: LUCENE-2245
 URL: https://issues.apache.org/jira/browse/LUCENE-2245
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/*
Affects Versions: 3.1
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2245.patch


 Many testcases in contrib use deprecated ctors for WhitespaceTokenizer / 
 Analyzer etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2242) Contrib CharTokenizer classes should be instantiated using their new Version based ctors

2010-01-31 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806839#action_12806839
 ] 

Simon Willnauer commented on LUCENE-2242:
-

I will commit this in a bit if nobody objects

 Contrib CharTokenizer classes should be instantiated using their new Version 
 based ctors
 

 Key: LUCENE-2242
 URL: https://issues.apache.org/jira/browse/LUCENE-2242
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2242.patch


 Contrib CharTokenizer classes should be instantiated using their new Version 
 based ctors introduced by LUCENE-2183 and LUCENE-2240

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2245) Remaining contrib testcases should use Version based ctors instead of deprecated ones

2010-01-31 Thread Simon Willnauer (JIRA)
Remaining contrib testcases should use Version based ctors instead of 
deprecated ones
-

 Key: LUCENE-2245
 URL: https://issues.apache.org/jira/browse/LUCENE-2245
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/*
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 3.1


Many testcases in contrib use deprecated ctors for WhitespaceTokenizer / 
Analyzer etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2245) Remaining contrib testcases should use Version based ctors instead of deprecated ones

2010-01-31 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2245:


Attachment: LUCENE-2245.patch

this patch fixes the remaining testcases in contrib.

 Remaining contrib testcases should use Version based ctors instead of 
 deprecated ones
 -

 Key: LUCENE-2245
 URL: https://issues.apache.org/jira/browse/LUCENE-2245
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/*
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2245.patch


 Many testcases in contrib use deprecated ctors for WhitespaceTokenizer / 
 Analyzer etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2245) Remaining contrib testcases should use Version based ctors instead of deprecated ones

2010-01-31 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2245:
---

Assignee: Simon Willnauer

 Remaining contrib testcases should use Version based ctors instead of 
 deprecated ones
 -

 Key: LUCENE-2245
 URL: https://issues.apache.org/jira/browse/LUCENE-2245
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/*
Affects Versions: 3.1
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2245.patch


 Many testcases in contrib use deprecated ctors for WhitespaceTokenizer / 
 Analyzer etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2242) Contrib CharTokenizer classes should be instantiated using their new Version based ctors

2010-01-31 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-2242.
-

Resolution: Fixed

Committed revision 905065.


 Contrib CharTokenizer classes should be instantiated using their new Version 
 based ctors
 

 Key: LUCENE-2242
 URL: https://issues.apache.org/jira/browse/LUCENE-2242
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2242.patch


 Contrib CharTokenizer classes should be instantiated using their new Version 
 based ctors introduced by LUCENE-2183 and LUCENE-2240

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2240) SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors

2010-01-30 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806658#action_12806658
 ] 

Simon Willnauer commented on LUCENE-2240:
-

bq. Patch looks good, I will commit this with LUCENE-2241 in a day or two. 
cool, I will go on with LUCENE-2242 and rest of contrib once this is committed

 SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors
 ---

 Key: LUCENE-2240
 URL: https://issues.apache.org/jira/browse/LUCENE-2240
 Project: Lucene - Java
  Issue Type: Task
  Components: Analysis
Affects Versions: 3.1
Reporter: Simon Willnauer
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2240.patch


 Due to the Changes to CharTokenizer ( LUCENE-2183 ) WhitespaceAnalyzer and 
 SimpleAnalyzer need a Version ctor. Default ctors must be deprecated

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2243) FastVectorHighlighter: support DisjunctionMaxQuery

2010-01-30 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806659#action_12806659
 ] 

Simon Willnauer commented on LUCENE-2243:
-

Koji, could you use a foreach loop instead of the iterator... just my 0.02$
{code}
DisjunctionMaxQuery dmq = (DisjunctionMaxQuery)sourceQuery;
for (Query query : dmq) {
  flatten(query, flatQueries);
}
{code}

simon

 FastVectorHighlighter: support DisjunctionMaxQuery
 --

 Key: LUCENE-2243
 URL: https://issues.apache.org/jira/browse/LUCENE-2243
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Affects Versions: 2.9, 2.9.1, 3.0
Reporter: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2243.patch


 Add DisjunctionMaxQuery support in FVH. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2238) deprecate ChineseAnalyzer

2010-01-29 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-2238.
-

Resolution: Fixed

committed in revision 904521

thanks robert

 deprecate ChineseAnalyzer
 -

 Key: LUCENE-2238
 URL: https://issues.apache.org/jira/browse/LUCENE-2238
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2238.patch


 The ChineseAnalyzer, ChineseTokenizer, and ChineseFilter (not the smart one, 
 or CJK) indexes chinese text as individual characters and removes english 
 stopwords, etc.
 In my opinion we should simply deprecate all of this in favor of 
 StandardAnalyzer, StandardTokenizer, and StopFilter, which does the same 
 thing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2239) Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt

2010-01-29 Thread Simon Willnauer (JIRA)
Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt
--

 Key: LUCENE-2239
 URL: https://issues.apache.org/jira/browse/LUCENE-2239
 Project: Lucene - Java
  Issue Type: Task
Reporter: Simon Willnauer


I created this issue as a spin off from 
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201001.mbox/%3cf18c9dde1001280051w4af2bc50u1cfd55f85e509...@mail.gmail.com%3e

We should decide what to do with NIOFSDirectory, if we want to keep it as the 
default on none-windows platforms and how we want to document this.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2239) Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt

2010-01-29 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2239:


  Component/s: Store
Affects Version/s: 2.4
   2.4.1
   2.9
   2.9.1
   3.0

 Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt
 --

 Key: LUCENE-2239
 URL: https://issues.apache.org/jira/browse/LUCENE-2239
 Project: Lucene - Java
  Issue Type: Task
  Components: Store
Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Simon Willnauer

 I created this issue as a spin off from 
 http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201001.mbox/%3cf18c9dde1001280051w4af2bc50u1cfd55f85e509...@mail.gmail.com%3e
 We should decide what to do with NIOFSDirectory, if we want to keep it as the 
 default on none-windows platforms and how we want to document this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2239) Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt

2010-01-29 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2239:


Attachment: LUCENE-2239.patch

This patch adds documentation to NIOFSDirectory and provides a testcase 
triggering the behavior. this might be little out of date now but I thought I 
add it for completeness

 Revise NIOFSDirectory and its usage due to NIO limitations on Thread.interrupt
 --

 Key: LUCENE-2239
 URL: https://issues.apache.org/jira/browse/LUCENE-2239
 Project: Lucene - Java
  Issue Type: Task
  Components: Store
Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Simon Willnauer
 Attachments: LUCENE-2239.patch


 I created this issue as a spin off from 
 http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201001.mbox/%3cf18c9dde1001280051w4af2bc50u1cfd55f85e509...@mail.gmail.com%3e
 We should decide what to do with NIOFSDirectory, if we want to keep it as the 
 default on none-windows platforms and how we want to document this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2240) SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors

2010-01-29 Thread Simon Willnauer (JIRA)
SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors
---

 Key: LUCENE-2240
 URL: https://issues.apache.org/jira/browse/LUCENE-2240
 Project: Lucene - Java
  Issue Type: Task
  Components: Analysis
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 3.1


Due to the Changes to CharTokenizer ( LUCENE-2183 ) WhitespaceAnalyzer and 
SimpleAnalyzer need a Version ctor. Default ctors must be deprecated

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2240) SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors

2010-01-29 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2240:


Attachment: LUCENE-2240.patch

This patch add the new Version ctors and deprecates the defaiult ctor. I did 
not change any references as I want to split those up in smaller issues. I 
already changed all references which resulted in a 400k patch. We should rather 
do it step by step.

 SimpleAnalyzer and WhitespaceAnalyzer should have Version ctors
 ---

 Key: LUCENE-2240
 URL: https://issues.apache.org/jira/browse/LUCENE-2240
 Project: Lucene - Java
  Issue Type: Task
  Components: Analysis
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2240.patch


 Due to the Changes to CharTokenizer ( LUCENE-2183 ) WhitespaceAnalyzer and 
 SimpleAnalyzer need a Version ctor. Default ctors must be deprecated

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2241) Core Tests should call Version based ctors instead of deprecated default ctors

2010-01-29 Thread Simon Willnauer (JIRA)
Core Tests should call Version based ctors instead of deprecated default ctors
--

 Key: LUCENE-2241
 URL: https://issues.apache.org/jira/browse/LUCENE-2241
 Project: Lucene - Java
  Issue Type: Task
  Components: Analysis
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 3.1


LUCENE-2183 introduced new ctors for all CharTokenizer subclasses. Core - tests 
should use those ctors with Version.LUCENE_CURRENT instead of the the 
deprecated ctors. Yet, LUCENE-2240 introduces more Version ctors For 
WhitespaceAnalyzer and SimpleAnalyzer. Test should also use their Version ctors 
instead the default ones.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2241) Core Tests should call Version based ctors instead of deprecated default ctors

2010-01-29 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2241:


Attachment: LUCENE-2241.patch

converted all core tests to use Version ctors

 Core Tests should call Version based ctors instead of deprecated default ctors
 --

 Key: LUCENE-2241
 URL: https://issues.apache.org/jira/browse/LUCENE-2241
 Project: Lucene - Java
  Issue Type: Task
  Components: Analysis
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2241.patch


 LUCENE-2183 introduced new ctors for all CharTokenizer subclasses. Core - 
 tests should use those ctors with Version.LUCENE_CURRENT instead of the the 
 deprecated ctors. Yet, LUCENE-2240 introduces more Version ctors For 
 WhitespaceAnalyzer and SimpleAnalyzer. Test should also use their Version 
 ctors instead the default ones.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2242) Contrib CharTokenizer classes should be instantiated using their new Version based ctors

2010-01-29 Thread Simon Willnauer (JIRA)
Contrib CharTokenizer classes should be instantiated using their new Version 
based ctors


 Key: LUCENE-2242
 URL: https://issues.apache.org/jira/browse/LUCENE-2242
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1


Contrib CharTokenizer classes should be instantiated using their new Version 
based ctors introduced by LUCENE-2183 and LUCENE-2240

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2242) Contrib CharTokenizer classes should be instantiated using their new Version based ctors

2010-01-29 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2242:


Attachment: LUCENE-2242.patch

converted contrib/analyzers

 Contrib CharTokenizer classes should be instantiated using their new Version 
 based ctors
 

 Key: LUCENE-2242
 URL: https://issues.apache.org/jira/browse/LUCENE-2242
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2242.patch


 Contrib CharTokenizer classes should be instantiated using their new Version 
 based ctors introduced by LUCENE-2183 and LUCENE-2240

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

2010-01-28 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2183:


Attachment: LUCENE-2183.patch

Added CHANGES.TXT entry and fixed 2 supplementary chars related bugs in the new 
version of incrementToken(). Testcases added for the bugs.

 Supplementary Character Handling in CharTokenizer
 -

 Key: LUCENE-2183
 URL: https://issues.apache.org/jira/browse/LUCENE-2183
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch, 
 LUCENE-2183.patch, LUCENE-2183.patch


 CharTokenizer is an abstract base class for all Tokenizers operating on a 
 character level. Yet, those tokenizers still use char primitives instead of 
 int codepoints. CharTokenizer should operate on codepoints and preserve bw 
 compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

2010-01-28 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805909#action_12805909
 ] 

Simon Willnauer commented on LUCENE-2183:
-

I did run following benchmark alg file against the latest patch (specialized 
old and new methods), the patch with the proxy methods and the old 3.0 code. 
The outcome shows that the specialized code is about ~8% faster than the proxy 
class based code so I would rather keep the specialized code as this class is 
performance sensitive though

.alg file
{quote}
analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer
content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
content.source.forever=false
{ Rounds { ReadTokens ReadTokens  : *  NewRound ResetSystemErase} : 10
RepAll
{quote}

10 Rounds with the latest patch
{quote}
 [java]  Report All (11 out of 12)
 [java] Operation  round   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] Rounds_10  010 0.00   
14.83 5,049,432 66,453,504
 [java] ReadTokens_Exhaust -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
2.07 -  34,558,000 -   55,705,600
 [java] ReadTokens_Exhaust 110 0.00
1.4041,865,312 60,555,264
 [java] ReadTokens_Exhaust -   2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.22 -  34,393,904 -   63,176,704
 [java] ReadTokens_Exhaust 310 0.00
1.2415,440,624 64,487,424
 [java] ReadTokens_Exhaust -   4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.22 -   7,540,512 -   65,601,536
 [java] ReadTokens_Exhaust 510 0.00
1.2150,174,760 67,239,936
 [java] ReadTokens_Exhaust -   6 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.19 -  22,202,768 -   67,174,400
 [java] ReadTokens_Exhaust 710 0.00
1.1920,591,672 68,812,800
 [java] ReadTokens_Exhaust -   8 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.18 -  63,749,984 -   69,009,408
 [java] ReadTokens_Exhaust 910 0.00
1.1922,331,600 68,943,872
{quote}

10 rounds with Proxy Class
{quote}
 [java]  Report All (11 out of 12)
 [java] Operation  round   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] Rounds_10  010 0.00   
16.33 5,021,144 67,436,544
 [java] ReadTokens_Exhaust -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
2.34 -  44,649,496 -   59,244,544
 [java] ReadTokens_Exhaust 110 0.00
1.5336,681,952 61,472,768
 [java] ReadTokens_Exhaust -   2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.37 -  13,863,688 -   64,094,208
 [java] ReadTokens_Exhaust 310 0.00
1.3450,247,864 65,470,464
 [java] ReadTokens_Exhaust -   4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.36 -  14,922,888 -   66,322,432
 [java] ReadTokens_Exhaust 510 0.00
1.36 5,718,296 67,371,008
 [java] ReadTokens_Exhaust -   6 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.32 -  54,583,776 -   67,502,080
 [java] ReadTokens_Exhaust 710 0.00
1.3335,739,800 68,943,872
 [java] ReadTokens_Exhaust -   8 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.32 -  24,985,688 -   69,861,376
 [java] ReadTokens_Exhaust 910 0.00
1.2964,138,112 69,730,304
{quote}

10 rounds with current trunk
{quote}
 [java]  Report All (11 out of 12)
 [java] Operation  round   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] Rounds_10   010 0.00
   15.19 5,040,928 66,256,896
 [java] ReadTokens_Exhaust -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
2.15 -  39,548,440 -   55,443,456
 [java] ReadTokens_Exhaust 110 0.00
1.4328,088,544 60,096,512
 [java] ReadTokens_Exhaust -   2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.27 -  16,004,088 -   61,800,448
 [java] ReadTokens_Exhaust 310 0.00
1.2551,034,016 63,045,632
 [java] ReadTokens_Exhaust -   4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.24 -  23,371,056 -   63,504,384
 [java] ReadTokens_Exhaust 510 0.00
1.2412,964,368 65,208,320
 [java] ReadTokens_Exhaust -   6 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.25 -   6,598,128 -   65,601,536
 [java] ReadTokens_Exhaust 710 0.00

[jira] Issue Comment Edited: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

2010-01-28 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805909#action_12805909
 ] 

Simon Willnauer edited comment on LUCENE-2183 at 1/28/10 1:16 PM:
--

I did run following benchmark alg file against the latest patch (specialized 
old and new methods), the patch with the proxy methods and the old 3.0 code. 
The outcome shows that the specialized code is about ~8% faster than the proxy 
class based code so I would rather keep the specialized code as this class is 
performance sensitive though

.alg file
{code}
analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer
content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
content.source.forever=false
{ Rounds { ReadTokens ReadTokens  : *  NewRound ResetSystemErase} : 10
RepAll
{code}

10 Rounds with the latest patch
{code}
 [java]  Report All (11 out of 12)
 [java] Operation  round   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] Rounds_10  010 0.00   
14.83 5,049,432 66,453,504
 [java] ReadTokens_Exhaust -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
2.07 -  34,558,000 -   55,705,600
 [java] ReadTokens_Exhaust 110 0.00
1.4041,865,312 60,555,264
 [java] ReadTokens_Exhaust -   2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.22 -  34,393,904 -   63,176,704
 [java] ReadTokens_Exhaust 310 0.00
1.2415,440,624 64,487,424
 [java] ReadTokens_Exhaust -   4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.22 -   7,540,512 -   65,601,536
 [java] ReadTokens_Exhaust 510 0.00
1.2150,174,760 67,239,936
 [java] ReadTokens_Exhaust -   6 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.19 -  22,202,768 -   67,174,400
 [java] ReadTokens_Exhaust 710 0.00
1.1920,591,672 68,812,800
 [java] ReadTokens_Exhaust -   8 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.18 -  63,749,984 -   69,009,408
 [java] ReadTokens_Exhaust 910 0.00
1.1922,331,600 68,943,872
{code}

10 rounds with Proxy Class
{code}
 [java]  Report All (11 out of 12)
 [java] Operation  round   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] Rounds_10  010 0.00   
16.33 5,021,144 67,436,544
 [java] ReadTokens_Exhaust -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
2.34 -  44,649,496 -   59,244,544
 [java] ReadTokens_Exhaust 110 0.00
1.5336,681,952 61,472,768
 [java] ReadTokens_Exhaust -   2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.37 -  13,863,688 -   64,094,208
 [java] ReadTokens_Exhaust 310 0.00
1.3450,247,864 65,470,464
 [java] ReadTokens_Exhaust -   4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.36 -  14,922,888 -   66,322,432
 [java] ReadTokens_Exhaust 510 0.00
1.36 5,718,296 67,371,008
 [java] ReadTokens_Exhaust -   6 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.32 -  54,583,776 -   67,502,080
 [java] ReadTokens_Exhaust 710 0.00
1.3335,739,800 68,943,872
 [java] ReadTokens_Exhaust -   8 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.32 -  24,985,688 -   69,861,376
 [java] ReadTokens_Exhaust 910 0.00
1.2964,138,112 69,730,304
{code}

10 rounds with current trunk
{code}
 [java]  Report All (11 out of 12)
 [java] Operation  round   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] Rounds_10   010 0.00
   15.19 5,040,928 66,256,896
 [java] ReadTokens_Exhaust -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
2.15 -  39,548,440 -   55,443,456
 [java] ReadTokens_Exhaust 110 0.00
1.4328,088,544 60,096,512
 [java] ReadTokens_Exhaust -   2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.27 -  16,004,088 -   61,800,448
 [java] ReadTokens_Exhaust 310 0.00
1.2551,034,016 63,045,632
 [java] ReadTokens_Exhaust -   4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.24 -  23,371,056 -   63,504,384
 [java] ReadTokens_Exhaust 510 0.00
1.2412,964,368 65,208,320
 [java] ReadTokens_Exhaust -   6 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   
1.25 -   6,598,128 -   65,601,536
 [java] ReadTokens_Exhaust 7 

[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

2010-01-28 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806028#action_12806028
 ] 

Simon Willnauer commented on LUCENE-2183:
-

bq. For that a link using javadoc {...@link Character#supplementary} would be 
good. I will fix this here, as I already have the patcxh applied and will 
commit it later.

Uwe I will take care of it and upload another patch. Thanks for being picky rob!

 Supplementary Character Handling in CharTokenizer
 -

 Key: LUCENE-2183
 URL: https://issues.apache.org/jira/browse/LUCENE-2183
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch, 
 LUCENE-2183.patch, LUCENE-2183.patch


 CharTokenizer is an abstract base class for all Tokenizers operating on a 
 character level. Yet, those tokenizers still use char primitives instead of 
 int codepoints. CharTokenizer should operate on codepoints and preserve bw 
 compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2238) deprecate ChineseAnalyzer

2010-01-28 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2238:
---

Assignee: Simon Willnauer

 deprecate ChineseAnalyzer
 -

 Key: LUCENE-2238
 URL: https://issues.apache.org/jira/browse/LUCENE-2238
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2238.patch


 The ChineseAnalyzer, ChineseTokenizer, and ChineseFilter (not the smart one, 
 or CJK) indexes chinese text as individual characters and removes english 
 stopwords, etc.
 In my opinion we should simply deprecate all of this in favor of 
 StandardAnalyzer, StandardTokenizer, and StopFilter, which does the same 
 thing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2238) deprecate ChineseAnalyzer

2010-01-28 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806269#action_12806269
 ] 

Simon Willnauer commented on LUCENE-2238:
-

+1 I will commit this later today if nobody objects

 deprecate ChineseAnalyzer
 -

 Key: LUCENE-2238
 URL: https://issues.apache.org/jira/browse/LUCENE-2238
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2238.patch


 The ChineseAnalyzer, ChineseTokenizer, and ChineseFilter (not the smart one, 
 or CJK) indexes chinese text as individual characters and removes english 
 stopwords, etc.
 In my opinion we should simply deprecate all of this in favor of 
 StandardAnalyzer, StandardTokenizer, and StopFilter, which does the same 
 thing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

2010-01-27 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805508#action_12805508
 ] 

Simon Willnauer commented on LUCENE-2183:
-

Short update: I found a bug in the latest version which was untested I will 
update soon with a speed comparison between the current version and the version 
using the proxy class.

 Supplementary Character Handling in CharTokenizer
 -

 Key: LUCENE-2183
 URL: https://issues.apache.org/jira/browse/LUCENE-2183
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch, 
 LUCENE-2183.patch


 CharTokenizer is an abstract base class for all Tokenizers operating on a 
 character level. Yet, those tokenizers still use char primitives instead of 
 int codepoints. CharTokenizer should operate on codepoints and preserve bw 
 compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1845) if the build fails to download JARs for contrib/db, just skip its tests

2010-01-17 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1845:


Attachment: LUCENE-1845.patch

I haven't looked at this issue for a while now but I figured today that the 
version we are using is not available for download anymore on the oracle 
pages. If you follow the link in the build file you will be able to download 
the zip file but I guess we should upgrade to the latest 3.3 version of BDB-JE.
(see 
http://www.oracle.com/technology/software/products/berkeley-db/je/index.html - 
version 3.3.69)
There is also another mirror that serves the jar directly (a maven repository) 
that might be more reliable.
I updated the patch to load the 3.3.93 version of the jar directly and skip the 
unzip step as we now download only the jar file. I also updated the maven pom 
template files to reference the right version of BDB-JE which wasn't the case 
though.

I think we should give the maven-repo mirror a chance though.



 if the build fails to download JARs for contrib/db, just skip its tests
 ---

 Key: LUCENE-1845
 URL: https://issues.apache.org/jira/browse/LUCENE-1845
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1845.patch, LUCENE-1845.patch, LUCENE-1845.txt, 
 LUCENE-1845.txt, LUCENE-1845.txt, LUCENE-1845.txt


 Every so often our nightly build fails because contrib/db is unable to 
 download the necessary BDB JARs from http://downloads.osafoundation.org.  I 
 think in such cases we should simply skip contrib/db's tests, if it's the 
 nightly build that's running, since it's a false positive failure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1845) if the build fails to download JARs for contrib/db, just skip its tests

2010-01-17 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801404#action_12801404
 ] 

Simon Willnauer commented on LUCENE-1845:
-

mike, can you take this issue it unfortunately touches core stuff :/

simon

 if the build fails to download JARs for contrib/db, just skip its tests
 ---

 Key: LUCENE-1845
 URL: https://issues.apache.org/jira/browse/LUCENE-1845
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1845.patch, LUCENE-1845.patch, LUCENE-1845.txt, 
 LUCENE-1845.txt, LUCENE-1845.txt, LUCENE-1845.txt


 Every so often our nightly build fails because contrib/db is unable to 
 download the necessary BDB JARs from http://downloads.osafoundation.org.  I 
 think in such cases we should simply skip contrib/db's tests, if it's the 
 nightly build that's running, since it's a false positive failure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2220) Stackoverflow when calling deprecated CharArraySet.copy()

2010-01-17 Thread Simon Willnauer (JIRA)
Stackoverflow when calling deprecated CharArraySet.copy()
-

 Key: LUCENE-2220
 URL: https://issues.apache.org/jira/browse/LUCENE-2220
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1


Calling CharArraySet#copy(set) without the version argument (deprecated) with 
an instance of CharArraySet results in a stack overflow as this method checks 
if the given set is a CharArraySet and then calls itself again. This was 
accidentially introduced due to an overloaded alternative method during 
LUCENE-2169 which was not used in the final patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2220) Stackoverflow when calling deprecated CharArraySet.copy()

2010-01-17 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2220:


Attachment: LUCENE-2220.patch

here is a patch and the extended testcase

 Stackoverflow when calling deprecated CharArraySet.copy()
 -

 Key: LUCENE-2220
 URL: https://issues.apache.org/jira/browse/LUCENE-2220
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2220.patch


 Calling CharArraySet#copy(set) without the version argument (deprecated) with 
 an instance of CharArraySet results in a stack overflow as this method checks 
 if the given set is a CharArraySet and then calls itself again. This was 
 accidentially introduced due to an overloaded alternative method during 
 LUCENE-2169 which was not used in the final patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1845) if the build fails to download JARs for contrib/db, just skip its tests

2010-01-17 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801416#action_12801416
 ] 

Simon Willnauer commented on LUCENE-1845:
-

mike, thanks for resolving this. I already replied to the commit mail but 
mention it here again for completeness
We should add a changes.txt entry to notify users that we upgraded the version.

simon

 if the build fails to download JARs for contrib/db, just skip its tests
 ---

 Key: LUCENE-1845
 URL: https://issues.apache.org/jira/browse/LUCENE-1845
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1845.patch, LUCENE-1845.patch, LUCENE-1845.txt, 
 LUCENE-1845.txt, LUCENE-1845.txt, LUCENE-1845.txt


 Every so often our nightly build fails because contrib/db is unable to 
 download the necessary BDB JARs from http://downloads.osafoundation.org.  I 
 think in such cases we should simply skip contrib/db's tests, if it's the 
 nightly build that's running, since it's a false positive failure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2198) support protected words in Stemming TokenFilters

2010-01-17 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2198:


Attachment: LUCENE-2198.patch

This patch ports all stemmers in core and contrib/analyzers to make use of the 
KeywordAttribute. 
I did not include snowball yet.

 support protected words in Stemming TokenFilters
 

 Key: LUCENE-2198
 URL: https://issues.apache.org/jira/browse/LUCENE-2198
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 3.0
Reporter: Robert Muir
Priority: Minor
 Attachments: LUCENE-2198.patch, LUCENE-2198.patch


 This is from LUCENE-1515
 I propose that all stemming TokenFilters have an 'exclusion set' that 
 bypasses any stemming for words in this set.
 Some stemming tokenfilters have this, some do not.
 This would be one way for Karl to implement his new swedish stemmer (as a 
 text file of ignore words).
 Additionally, it would remove duplication between lucene and solr, as they 
 reimplement snowballfilter since it does not have this functionality.
 Finally, I think this is a pretty common use case, where people want to 
 ignore things like proper nouns in the stemming.
 As an alternative design I considered a case where we generalized this to 
 CharArrayMap (and ignoring words would mean mapping them to themselves), 
 which would also provide a mechanism to override the stemming algorithm. But 
 I think this is too expert, could be its own filter, and the only example of 
 this i can find is in the Dutch stemmer.
 So I think we should just provide ignore with CharArraySet, but if you feel 
 otherwise please comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters

2010-01-17 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801446#action_12801446
 ] 

Simon Willnauer commented on LUCENE-2198:
-

I kind of agree with both of you. When I started implementing this attribute I 
had FlagAttribute in mind but I didn't choose it because users can randomly 
choose a bit of the word which might lead to unexpected behavior. 

Another solution I had in mind is to introduce another Attribute (or extend 
FlagAttribute) holding a Lucene private (not the java visibility keyword) Enum 
that can be extended in the future. Internally this could use a word or a 
Bitset (a word will do I guess) where bits can be set according to the enum 
ord. That way we could encode way more than only one single boolean and the 
cost of adding new flags / enum values would be minimal.

{code}
booleanAttribute.isSet(BooelanAttributeEnum.Keyword)
{code}

something like that, thoughts?

 support protected words in Stemming TokenFilters
 

 Key: LUCENE-2198
 URL: https://issues.apache.org/jira/browse/LUCENE-2198
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 3.0
Reporter: Robert Muir
Priority: Minor
 Attachments: LUCENE-2198.patch, LUCENE-2198.patch


 This is from LUCENE-1515
 I propose that all stemming TokenFilters have an 'exclusion set' that 
 bypasses any stemming for words in this set.
 Some stemming tokenfilters have this, some do not.
 This would be one way for Karl to implement his new swedish stemmer (as a 
 text file of ignore words).
 Additionally, it would remove duplication between lucene and solr, as they 
 reimplement snowballfilter since it does not have this functionality.
 Finally, I think this is a pretty common use case, where people want to 
 ignore things like proper nouns in the stemming.
 As an alternative design I considered a case where we generalized this to 
 CharArrayMap (and ignoring words would mean mapping them to themselves), 
 which would also provide a mechanism to override the stemming algorithm. But 
 I think this is too expert, could be its own filter, and the only example of 
 this i can find is in the Dutch stemmer.
 So I think we should just provide ignore with CharArraySet, but if you feel 
 otherwise please comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2206) integrate snowball stopword lists

2010-01-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801163#action_12801163
 ] 

Simon Willnauer commented on LUCENE-2206:
-

Robert, patch looks good except of one thing. 
{code}
  public static HashSetString getSnowballWordSet(Reader reader)
{code}

it returns a hashset but should really return a SetString. We plan to change 
all return types to the interface instead of the implementation.


 integrate snowball stopword lists
 -

 Key: LUCENE-2206
 URL: https://issues.apache.org/jira/browse/LUCENE-2206
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2206.patch


 The snowball project creates stopword lists as well as stemmers, example: 
 http://svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt?view=markup
 This patch includes the following:
 * snowball stopword lists for 13 languages in contrib/snowball/resources
 * all stoplists are unmodified, only added license header and converted each 
 one from whatever encoding it was in to UTF-8
 * added getSnowballWordSet  to WordListLoader, this is because the format of 
 these files is very different, for example it supports multiple words per 
 line and embedded comments.
 I did not add any changes to SnowballAnalyzer to actually automatically use 
 these lists yet, i would like us to discuss this in a future issue proposing 
 integrating snowball with contrib/analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2212) add a test for PorterStemFilter

2010-01-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801181#action_12801181
 ] 

Simon Willnauer commented on LUCENE-2212:
-

Nice robert, I was adding a test class for PorterStemFilter during LUCENE-2198 
to test the KeywordAttr. Yet this looks very good though.
I wonder if we should use GetResourcesAsStream rather than the system property. 
the resources should always be on the classpath.



 add a test for PorterStemFilter
 ---

 Key: LUCENE-2212
 URL: https://issues.apache.org/jira/browse/LUCENE-2212
 Project: Lucene - Java
  Issue Type: Test
  Components: Analysis
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2212.patch, porterTestData.zip


 There are no tests for PorterStemFilter, yet svn history reveals some (very 
 minor) cleanups, etc.
 The only thing executing its code in tests is a test or two in SmartChinese 
 tests.
 This patch runs the StemFilter against Martin Porter's test data set for this 
 stemmer, checking for expected output.
 The zip file is 100KB added to src/test, if this is too large I can change it 
 to download the data instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2212) add a test for PorterStemFilter

2010-01-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801243#action_12801243
 ] 

Simon Willnauer commented on LUCENE-2212:
-

bq. updated patch with getResource() + ZipFile 

:) thanks

bq. will commit this test at the end of the day unless anyone objects.
+1 go ahead

 add a test for PorterStemFilter
 ---

 Key: LUCENE-2212
 URL: https://issues.apache.org/jira/browse/LUCENE-2212
 Project: Lucene - Java
  Issue Type: Test
  Components: Analysis
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2212.patch, LUCENE-2212.patch, porterTestData.zip


 There are no tests for PorterStemFilter, yet svn history reveals some (very 
 minor) cleanups, etc.
 The only thing executing its code in tests is a test or two in SmartChinese 
 tests.
 This patch runs the StemFilter against Martin Porter's test data set for this 
 stemmer, checking for expected output.
 The zip file is 100KB added to src/test, if this is too large I can change it 
 to download the data instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2195) Speedup CharArraySet if set is empty

2010-01-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801252#action_12801252
 ] 

Simon Willnauer commented on LUCENE-2195:
-

bq. I do not think unmodifiableset should have a no-arg ctor, so instead i 
pushed this up to emptychararrayset
ok I'm fine with that.

{quote}
i do not think emptychararrayset need override and throw uoe for removeAll or 
retainAll, and i don't think the tests were correct in assuming it will throw 
uoe. it will not throw uoe for say, removeAll only because it is empty. it will 
just do nothing.
{quote}

You are right, this should only throw this exception if the set contains it and 
the Iterator does not implement remove()
{code}
 * Note that this implementation throws an
 * ttUnsupportedOperationException/tt if the iterator returned by this
 * collection's iterator method does not implement the ttremove/tt
 * method and this collection contains the specified object.
{code}

same is true for AbstractSet#removeAll()   retainAll()

Thanks for updating it. I think this is good to go though! 



 Speedup CharArraySet if set is empty
 

 Key: LUCENE-2195
 URL: https://issues.apache.org/jira/browse/LUCENE-2195
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2195.patch, LUCENE-2195.patch, LUCENE-2195.patch, 
 LUCENE-2195.patch


 CharArraySet#contains(...) always creates a HashCode of the String, Char[] or 
 CharSequence even if the set is empty. 
 contains should return false if set it empty

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

2010-01-15 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2183:


Attachment: LUCENE-2183.patch

I updated the patch to make use of the nice reflection utils and ported all 
subclasses of CharTokenizer to the int based API.
Due to the addition of Version to CharTokenizer ctors this patch creates a lot 
of usage of deprecated API.
Yet, I haven't changed all the usage of the deprecated ctors, this should be 
done in another issue IMO.

 Supplementary Character Handling in CharTokenizer
 -

 Key: LUCENE-2183
 URL: https://issues.apache.org/jira/browse/LUCENE-2183
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2183.patch, LUCENE-2183.patch


 CharTokenizer is an abstract base class for all Tokenizers operating on a 
 character level. Yet, those tokenizers still use char primitives instead of 
 int codepoints. CharTokenizer should operate on codepoints and preserve bw 
 compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2214) Remove deprecated StemExclusionSet setters in contrib/analyzers

2010-01-15 Thread Simon Willnauer (JIRA)
Remove deprecated StemExclusionSet setters in contrib/analyzers
---

 Key: LUCENE-2214
 URL: https://issues.apache.org/jira/browse/LUCENE-2214
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 3.1


Lots of stem exclusion sets have been deprecated in 3.0. As we are in contrib 
land we could now remove them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

2010-01-15 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2183:


Attachment: LUCENE-2183.patch

Uwe, using an interface doesn't work though as I can not reduce the public 
visibility in CharTokeinzer. Yet, this patch tries to solve it with an abstract 
class.
To be honest I would rather say we duplicate the code and use a simple boolean 
switch in incrementToken. Not that nice but def. faster.

what do you think?

 Supplementary Character Handling in CharTokenizer
 -

 Key: LUCENE-2183
 URL: https://issues.apache.org/jira/browse/LUCENE-2183
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch


 CharTokenizer is an abstract base class for all Tokenizers operating on a 
 character level. Yet, those tokenizers still use char primitives instead of 
 int codepoints. CharTokenizer should operate on codepoints and preserve bw 
 compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2195) Speedup CharArraySet if set is empty

2010-01-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800405#action_12800405
 ] 

Simon Willnauer commented on LUCENE-2195:
-

any comments on the latest patch?

 Speedup CharArraySet if set is empty
 

 Key: LUCENE-2195
 URL: https://issues.apache.org/jira/browse/LUCENE-2195
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2195.patch, LUCENE-2195.patch, LUCENE-2195.patch


 CharArraySet#contains(...) always creates a HashCode of the String, Char[] or 
 CharSequence even if the set is empty. 
 contains should return false if set it empty

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2198) support protected words in Stemming TokenFilters

2010-01-13 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2198:


Attachment: LUCENE-2198.patch

This patch contains an intial design proposal. I tried to name the new 
attribute a little bit more generic as this could easily be used outside of the 
stemming domain.

all tests pass -- comments welcome.

 support protected words in Stemming TokenFilters
 

 Key: LUCENE-2198
 URL: https://issues.apache.org/jira/browse/LUCENE-2198
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 3.0
Reporter: Robert Muir
Priority: Minor
 Attachments: LUCENE-2198.patch


 This is from LUCENE-1515
 I propose that all stemming TokenFilters have an 'exclusion set' that 
 bypasses any stemming for words in this set.
 Some stemming tokenfilters have this, some do not.
 This would be one way for Karl to implement his new swedish stemmer (as a 
 text file of ignore words).
 Additionally, it would remove duplication between lucene and solr, as they 
 reimplement snowballfilter since it does not have this functionality.
 Finally, I think this is a pretty common use case, where people want to 
 ignore things like proper nouns in the stemming.
 As an alternative design I considered a case where we generalized this to 
 CharArrayMap (and ignoring words would mean mapping them to themselves), 
 which would also provide a mechanism to override the stemming algorithm. But 
 I think this is too expert, could be its own filter, and the only example of 
 this i can find is in the Dutch stemmer.
 So I think we should just provide ignore with CharArraySet, but if you feel 
 otherwise please comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2203) improved snowball testing

2010-01-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799868#action_12799868
 ] 

Simon Willnauer commented on LUCENE-2203:
-

looks good to me. I haven't applied it but looks good though! +1 from my side

 improved snowball testing
 -

 Key: LUCENE-2203
 URL: https://issues.apache.org/jira/browse/LUCENE-2203
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Attachments: LUCENE-2203.patch


 Snowball project has test vocabulary files for each language in their svn 
 repository, along with expected output.
 We should use these tests to ensure all languages are working correctly, and 
 it might be helpful in the future for identifying back breaks/changes if we 
 ever want to upgrade snowball, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2188) A handy utility class for tracking deprecated overridden methods

2010-01-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799387#action_12799387
 ] 

Simon Willnauer commented on LUCENE-2188:
-

good stuff uwe, I will fix LUCENE-2183 now.

 A handy utility class for tracking deprecated overridden methods
 

 Key: LUCENE-2188
 URL: https://issues.apache.org/jira/browse/LUCENE-2188
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch, 
 LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch


 This issue provides a new handy utility class that keeps track of overridden 
 deprecated methods in non-final sub classes. This class can be used in new 
 deprecations.
 See the javadocs for an example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2203) improved snowball testing

2010-01-11 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798795#action_12798795
 ] 

Simon Willnauer commented on LUCENE-2203:
-

Robert, those test seem to be very extensive - thats good!
But honestly think we should make those tests optional in some way. The files 
you are downloading are very large and might be an issues for some folks. The 
filesize is over 70MB which is a lot for a test. I need to thing about this a 
little and come up with some suggestions.

 improved snowball testing
 -

 Key: LUCENE-2203
 URL: https://issues.apache.org/jira/browse/LUCENE-2203
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir
Priority: Minor
 Attachments: LUCENE-2203.patch


 Snowball project has test vocabulary files for each language in their svn 
 repository, along with expected output.
 We should use these tests to ensure all languages are working correctly, and 
 it might be helpful in the future for identifying back breaks/changes if we 
 ever want to upgrade snowball, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-10 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798474#action_12798474
 ] 

Simon Willnauer commented on LUCENE-2199:
-

I plan to commit this in today or tomorrow. Somebody volunteering to backport?

simon

 ShingleFilter skips over trie-shingles if outputUnigram is set to false
 ---

 Key: LUCENE-2199
 URL: https://issues.apache.org/jira/browse/LUCENE-2199
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2199.patch, LUCENE-2199.patch


 Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa
 {quote}
 I noticed that if I set outputUnigrams to false it gives me the same output 
 for
 maxShingleSize=2 and maxShingleSize=3.
 please divide divide this this sentence
 when i set maxShingleSize to 4 output is:
 please divide please divide this sentence divide this this sentence
 I was expecting the output as follows with maxShingleSize=3 and
 outputUnigrams=false :
 please divide this divide this sentence 
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2200) Several final classes have non-overriding protected members

2010-01-10 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798516#action_12798516
 ] 

Simon Willnauer commented on LUCENE-2200:
-

Robert, when you commit this make sure you mark  the Attributes in 
EdgeNGramTokenFilter.java final thanks.
Steve thanks for the patch, such work is always appreciated.

simon

 Several final classes have non-overriding protected members
 ---

 Key: LUCENE-2200
 URL: https://issues.apache.org/jira/browse/LUCENE-2200
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
Reporter: Steven Rowe
Assignee: Robert Muir
Priority: Trivial
 Attachments: LUCENE-2200.patch, LUCENE-2200.patch


 Protected member access in final classes, except where a protected method 
 overrides a superclass's protected method, makes little sense.  The attached 
 patch converts final classes' protected access on fields to private, removes 
 two final classes' unused protected constructors, and converts one final 
 class's protected final method to private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet

2010-01-10 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798517#action_12798517
 ] 

Simon Willnauer commented on LUCENE-2197:
-

Yonik, would you commit this issue please. I think we agreed on your solution.
simon

 StopFilter should not create a new CharArraySet if the given set is already 
 an instance of CharArraySet
 ---

 Key: LUCENE-2197
 URL: https://issues.apache.org/jira/browse/LUCENE-2197
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Critical
 Fix For: 3.1

 Attachments: LUCENE-2197.patch, LUCENE-2197.patch


 With LUCENE-2094 a new CharArraySet is created no matter what type of set is 
 passed to StopFilter. This does not behave as  documented and could introduce 
 serious performance problems. Yet, according to the javadoc, the instance of 
 CharArraySet should be passed to CharArraySet.copy (which is very fast for 
 CharArraySet instances) instead of copied via new CharArraySet()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-10 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798522#action_12798522
 ] 

Simon Willnauer commented on LUCENE-2199:
-

I committed this in revision 897672
Robert, would you please backport this to 2.9 / 3.0 - thanks for the offer!

simon

 ShingleFilter skips over trie-shingles if outputUnigram is set to false
 ---

 Key: LUCENE-2199
 URL: https://issues.apache.org/jira/browse/LUCENE-2199
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2199.patch, LUCENE-2199.patch


 Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa
 {quote}
 I noticed that if I set outputUnigrams to false it gives me the same output 
 for
 maxShingleSize=2 and maxShingleSize=3.
 please divide divide this this sentence
 when i set maxShingleSize to 4 output is:
 please divide please divide this sentence divide this this sentence
 I was expecting the output as follows with maxShingleSize=3 and
 outputUnigrams=false :
 please divide this divide this sentence 
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters

2010-01-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798388#action_12798388
 ] 

Simon Willnauer commented on LUCENE-2198:
-

bq. So I think we should just provide ignore with CharArraySet, but if you feel 
otherwise please comment.
While I read your proposal a possibly more flexible design came to my mind. We 
could introduce a StemAttribute that has a method public boolean stem() used by 
every stemmer to decide if a token should be stemmed. That way we decouple the 
decision if a token should be stemmed from the stemming algorithm. This also 
enables custom filters to set the values based on other reasons aside from a 
term being in a set. 
The default value for sure it true but can be set on any condition. inside an 
analyzer we can add a filter right before the stemmer based on a CharArraySet. 
Yet if the set is empty or null we simply leave the filter out. 



 support protected words in Stemming TokenFilters
 

 Key: LUCENE-2198
 URL: https://issues.apache.org/jira/browse/LUCENE-2198
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 3.0
Reporter: Robert Muir
Priority: Minor

 This is from LUCENE-1515
 I propose that all stemming TokenFilters have an 'exclusion set' that 
 bypasses any stemming for words in this set.
 Some stemming tokenfilters have this, some do not.
 This would be one way for Karl to implement his new swedish stemmer (as a 
 text file of ignore words).
 Additionally, it would remove duplication between lucene and solr, as they 
 reimplement snowballfilter since it does not have this functionality.
 Finally, I think this is a pretty common use case, where people want to 
 ignore things like proper nouns in the stemming.
 As an alternative design I considered a case where we generalized this to 
 CharArrayMap (and ignoring words would mean mapping them to themselves), 
 which would also provide a mechanism to override the stemming algorithm. But 
 I think this is too expert, could be its own filter, and the only example of 
 this i can find is in the Dutch stemmer.
 So I think we should just provide ignore with CharArraySet, but if you feel 
 otherwise please comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2200) Several final classes have non-overriding protected members

2010-01-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798389#action_12798389
 ] 

Simon Willnauer commented on LUCENE-2200:
-

Steve, I briefly looked at your patch. Could we make some of the member vars 
final too? 
The reader in CharReader or the defaultAnalyzer in ShingleAnalyzerWrapper for 
instance.

simon

 Several final classes have non-overriding protected members
 ---

 Key: LUCENE-2200
 URL: https://issues.apache.org/jira/browse/LUCENE-2200
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
Reporter: Steven Rowe
Priority: Trivial
 Attachments: LUCENE-2200.patch


 Protected member access in final classes, except where a protected method 
 overrides a superclass's protected method, makes little sense.  The attached 
 patch converts final classes' protected access on fields to private, removes 
 two final classes' unused protected constructors, and converts one final 
 class's protected final method to private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet

2010-01-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797952#action_12797952
 ] 

Simon Willnauer commented on LUCENE-2197:
-

bq. Here's a patch that reverts to the previous behavior of using the set 
provided. 
Doesn't seem to lead anywhere to discuss with the performance police when I 
look at the average size of your comments. :)
This was actually meant to be a pattern for analyzer subclasses so I won't be 
the immutability police here. Yonik, will you take this issue and commit?!

bq. We should really avoid this type of nannyism in Lucene.
oh well this seems to me like a  void * is / isn't evil discussion - nevermind.

 StopFilter should not create a new CharArraySet if the given set is already 
 an instance of CharArraySet
 ---

 Key: LUCENE-2197
 URL: https://issues.apache.org/jira/browse/LUCENE-2197
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Critical
 Fix For: 3.1

 Attachments: LUCENE-2197.patch, LUCENE-2197.patch


 With LUCENE-2094 a new CharArraySet is created no matter what type of set is 
 passed to StopFilter. This does not behave as  documented and could introduce 
 serious performance problems. Yet, according to the javadoc, the instance of 
 CharArraySet should be passed to CharArraySet.copy (which is very fast for 
 CharArraySet instances) instead of copied via new CharArraySet()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1967) make it easier to access default stopwords for language analyzers

2010-01-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer closed LUCENE-1967.
---

Resolution: Fixed

incorporated in LUCENE-2034

 make it easier to access default stopwords for language analyzers
 -

 Key: LUCENE-1967
 URL: https://issues.apache.org/jira/browse/LUCENE-1967
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Simon Willnauer
Priority: Minor

 DM Smith made the following comment: (sometimes it is hard to dig out the 
 stop set from the analyzers)
 Looking around, some of these analyzers have very different ways of storing 
 the default list.
 One idea is to consider generalizing something like what Simon did with 
 LUCENE-1965, LUCENE-1962,
 and having all stopwords lists stored as .txt files in resources folder.
 {code}
   /**
* Returns an unmodifiable instance of the default stop-words set.
* @return an unmodifiable instance of the default stop-words set.
*/
   public static SetString getDefaultStopSet()
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-08 Thread Simon Willnauer (JIRA)
ShingleFilter skips over trie-shingles if outputUnigram is set to false
---

 Key: LUCENE-2199
 URL: https://issues.apache.org/jira/browse/LUCENE-2199
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.0, 2.9.1, 2.9, 2.4.1, 2.4
Reporter: Simon Willnauer
 Fix For: 3.1


Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa

{quote}
I noticed that if I set outputUnigrams to false it gives me the same output for
maxShingleSize=2 and maxShingleSize=3.

please divide divide this this sentence

when i set maxShingleSize to 4 output is:

please divide please divide this sentence divide this this sentence

I was expecting the output as follows with maxShingleSize=3 and
outputUnigrams=false :

please divide this divide this sentence 
{quote}




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2199:


Attachment: LUCENE-2199.patch

This patch adds test for trigram and fourgram with and without outputUnigram. 
All tests pass

 ShingleFilter skips over trie-shingles if outputUnigram is set to false
 ---

 Key: LUCENE-2199
 URL: https://issues.apache.org/jira/browse/LUCENE-2199
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2199.patch


 Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa
 {quote}
 I noticed that if I set outputUnigrams to false it gives me the same output 
 for
 maxShingleSize=2 and maxShingleSize=3.
 please divide divide this this sentence
 when i set maxShingleSize to 4 output is:
 please divide please divide this sentence divide this this sentence
 I was expecting the output as follows with maxShingleSize=3 and
 outputUnigrams=false :
 please divide this divide this sentence 
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798162#action_12798162
 ] 

Simon Willnauer commented on LUCENE-2199:
-

We should likely backport this to 2.9 / 3.0 too

 ShingleFilter skips over trie-shingles if outputUnigram is set to false
 ---

 Key: LUCENE-2199
 URL: https://issues.apache.org/jira/browse/LUCENE-2199
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2199.patch


 Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa
 {quote}
 I noticed that if I set outputUnigrams to false it gives me the same output 
 for
 maxShingleSize=2 and maxShingleSize=3.
 please divide divide this this sentence
 when i set maxShingleSize to 4 output is:
 please divide please divide this sentence divide this this sentence
 I was expecting the output as follows with maxShingleSize=3 and
 outputUnigrams=false :
 please divide this divide this sentence 
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2199:
---

Assignee: Simon Willnauer

 ShingleFilter skips over trie-shingles if outputUnigram is set to false
 ---

 Key: LUCENE-2199
 URL: https://issues.apache.org/jira/browse/LUCENE-2199
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2199.patch, LUCENE-2199.patch


 Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa
 {quote}
 I noticed that if I set outputUnigrams to false it gives me the same output 
 for
 maxShingleSize=2 and maxShingleSize=3.
 please divide divide this this sentence
 when i set maxShingleSize to 4 output is:
 please divide please divide this sentence divide this this sentence
 I was expecting the output as follows with maxShingleSize=3 and
 outputUnigrams=false :
 please divide this divide this sentence 
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet

2010-01-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798189#action_12798189
 ] 

Simon Willnauer commented on LUCENE-2197:
-

bq. Sorry Simon... I think I just got fed up with stuff like this in the JDK 
over the years (that forces people to write their own implementations for best 
performance), and you happened to be the closest person at the time 
:) no worries, thanks for the reply!

bq. To the software pedant, that's not safe and would probably be called bad 
design - ...
I understand and I can totally see your point. I was kind of separated due to 
the kind of short rants (don't get me wrong). I agree with you that we should 
not do that in a filter as this constructor could be called very very 
frequently especially if an analyzer does not implement reusableTokenStream. I 
would still argue that for an analyzer this is a different story and I would 
want to keep the code in analyzers copying the set. Classes, instantiated so 
frequently as filters should not introduce possible bottlenecks while analyzers 
are usually shared that won't be much of a hassle - any performance police 
issues with this? :)

 StopFilter should not create a new CharArraySet if the given set is already 
 an instance of CharArraySet
 ---

 Key: LUCENE-2197
 URL: https://issues.apache.org/jira/browse/LUCENE-2197
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Critical
 Fix For: 3.1

 Attachments: LUCENE-2197.patch, LUCENE-2197.patch


 With LUCENE-2094 a new CharArraySet is created no matter what type of set is 
 passed to StopFilter. This does not behave as  documented and could introduce 
 serious performance problems. Yet, according to the javadoc, the instance of 
 CharArraySet should be passed to CharArraySet.copy (which is very fast for 
 CharArraySet instances) instead of copied via new CharArraySet()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2195) Speedup CharArraySet if set is empty

2010-01-07 Thread Simon Willnauer (JIRA)
Speedup CharArraySet if set is empty


 Key: LUCENE-2195
 URL: https://issues.apache.org/jira/browse/LUCENE-2195
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1


CharArraySet#contains(...) always creates a HashCode of the String, Char[] or 
CharSequence even if the set is empty. 
contains should return false if set it empty

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2195) Speedup CharArraySet if set is empty

2010-01-07 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2195:


Attachment: LUCENE-2195.patch

here is a patch

 Speedup CharArraySet if set is empty
 

 Key: LUCENE-2195
 URL: https://issues.apache.org/jira/browse/LUCENE-2195
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2195.patch


 CharArraySet#contains(...) always creates a HashCode of the String, Char[] or 
 CharSequence even if the set is empty. 
 contains should return false if set it empty

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2195) Speedup CharArraySet if set is empty

2010-01-07 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2195:


Attachment: LUCENE-2195.patch

Updated patch. This patch does not count==0 check in contains(Object) as the 
o.toString() could return null and the NPE would be silently swallowed if the 
set is empty. 
The null check and NPE are necessary to yield consistent behavior no matter if 
the set is empty or not.

 Speedup CharArraySet if set is empty
 

 Key: LUCENE-2195
 URL: https://issues.apache.org/jira/browse/LUCENE-2195
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2195.patch, LUCENE-2195.patch


 CharArraySet#contains(...) always creates a HashCode of the String, Char[] or 
 CharSequence even if the set is empty. 
 contains should return false if set it empty

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2196) Spellchecker should implement java.io.Closable

2010-01-07 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2196:


Attachment: LUCENE-2196.patch

 Spellchecker should implement java.io.Closable
 --

 Key: LUCENE-2196
 URL: https://issues.apache.org/jira/browse/LUCENE-2196
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2196.patch


 As the most of the lucene classes implement Closable (IndexWriter) 
 Spellchecker should do too. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2196) Spellchecker should implement java.io.Closable

2010-01-07 Thread Simon Willnauer (JIRA)
Spellchecker should implement java.io.Closable
--

 Key: LUCENE-2196
 URL: https://issues.apache.org/jira/browse/LUCENE-2196
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
Reporter: Simon Willnauer
 Fix For: 3.1
 Attachments: LUCENE-2196.patch

As the most of the lucene classes implement Closable (IndexWriter) Spellchecker 
should do too. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2010-01-07 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797697#action_12797697
 ] 

Simon Willnauer commented on LUCENE-2108:
-

Created sep. issue for that purpose LUCENE-2196

 SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
 SpellChecker internally
 -

 Key: LUCENE-2108
 URL: https://issues.apache.org/jira/browse/LUCENE-2108
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/spellchecker
Affects Versions: 3.0
Reporter: Eirik Bjorsnos
Assignee: Simon Willnauer
 Fix For: 3.0.1, 3.1

 Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch, 
 LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch, 
 LUCENE-2108_Lucene_2_9_branch.patch, LUCENE-2108_test_java14.patch


 I can't find any way to close the IndexSearcher (and IndexReader) that
 is being used by SpellChecker internally.
 I've worked around this issue by keeping a single SpellChecker open
 for each index, but I'd really like to be able to close it and
 reopen it on demand without leaking file descriptors.
 Could we add a close() method to SpellChecker that will close the
 IndexSearcher and null the reference to it? And perhaps add some code
 that reopens the searcher if the reference to it is null? Or would
 that break thread safety of SpellChecker?
 The attached patch adds a close method but leaves it to the user to
 call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2196) Spellchecker should implement java.io.Closable

2010-01-07 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-2196.
-

   Resolution: Fixed
Fix Version/s: 3.0.1

committed in revision 896934 

thanks uwe

 Spellchecker should implement java.io.Closable
 --

 Key: LUCENE-2196
 URL: https://issues.apache.org/jira/browse/LUCENE-2196
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
Reporter: Simon Willnauer
 Fix For: 3.0.1, 3.1

 Attachments: LUCENE-2196.patch


 As the most of the lucene classes implement Closable (IndexWriter) 
 Spellchecker should do too. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2195) Speedup CharArraySet if set is empty

2010-01-07 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2195:


Attachment: LUCENE-2195.patch

I changed my patch to please Yonik who has performance concerns as well as 
robert who wants to use EMTPY_SET instead of set == null checks. 
I agree with robert that I would rather have an empty set instead of null 
asssigned if the set is omitted or if the default set is empty. Yet, I 
subclassed UnmodifiableCharArraySet and added a specailized implementation for 
EMPTY_SET that checks for null to throw the NPE and otherwise always returns 
false for all contains(...) methods.
This class is final and the overhead for the method call should be very small.

 

 Speedup CharArraySet if set is empty
 

 Key: LUCENE-2195
 URL: https://issues.apache.org/jira/browse/LUCENE-2195
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2195.patch, LUCENE-2195.patch, LUCENE-2195.patch


 CharArraySet#contains(...) always creates a HashCode of the String, Char[] or 
 CharSequence even if the set is empty. 
 contains should return false if set it empty

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2010-01-07 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797758#action_12797758
 ] 

Simon Willnauer commented on LUCENE-2094:
-

Hi Yonik,

bq. It looks like it was committed as part of this issue, but I can't find any 
comments here about either the need to make a copy or the need to make a 
unmodifiable set.
I try to help you to reconstruct the whole thing a bit. 
UnmodifiableCharArraySet was introduces with LUCENE-1688 as far as I recall to 
replace the static string array (stopwords) in StopAnalyzer. 
During the refactoring / improvements in contrib/analyzers we decided to make 
analyzers and tokenfilters immutable and use chararrayset whereever we can. To 
prevent provided set from being modified while they are in use in a filter the 
given set is copied and wrapped in an immutable instance of chararrayset. At 
the same time (still ongoing) we try to convert every set which is likely to be 
used in a TokenFilter into a charArraySet.  Wordlistloader is not done yet but 
on the list, the plan is to change the return values from HashSet? into 
Set? and create CharArraySet instances internally. 
With LUCENE-2034 we introduced StopwordAnalyzerBase which also uses the 
UnmodifiableCharArraySet with a copy of the given set.
The copy of a charArraySet is very fast even for large sets and the creation of 
a unmodifiableCharArraySet from a CharArraySet instance is basically just an 
object creation. The background is, again to prevent any modification to those 
sets while they are in use.

bq. This new behavior also no longer matches the javadoc for the constructor. 
I agree we should adjust the javadoc for ctors expecting stopwords to reflect 
the behavior.



 Prepare CharArraySet for Unicode 4.0
 

 Key: LUCENE-2094
 URL: https://issues.apache.org/jira/browse/LUCENE-2094
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.0
Reporter: Simon Willnauer
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
 LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
 LUCENE-2094.txt, LUCENE-2094.txt


 CharArraySet does lowercaseing if created with the correspondent flag. This 
 causes that  String / char[] with uncode 4 chars which are in the set can not 
 be retrieved in ignorecase mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2194) improve efficiency of snowballfilter

2010-01-07 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797762#action_12797762
 ] 

Simon Willnauer commented on LUCENE-2194:
-

looks good robert. Nice improvement.

 improve efficiency of snowballfilter
 

 Key: LUCENE-2194
 URL: https://issues.apache.org/jira/browse/LUCENE-2194
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2194.patch


 snowball stemming currently creates 2 new strings and 1 new stringbuilder for 
 every word.
 all of this is unnecessary, so don't do it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2010-01-07 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797772#action_12797772
 ] 

Simon Willnauer commented on LUCENE-2094:
-

bq. Simon, I think yonik refers to this code in stopfilter itself: 
I see, the problem with this piece of code is that it has the caseinsensitive 
flag which would be ignored if I would not create such a set though. As far as 
I can see even previous version did not really do what the javadoc says. 
{code}
  if (stopWords instanceof CharArraySet) {
  this.stopWords = (CharArraySet)stopWords;
} else {
  this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
  this.stopWords.addAll(stopWords);
}
{code}

I agree we should prevent this costly operation but it doesn't seem to be easy 
though. My first impression is to deprecate the ctors which have the ignorecase 
boolean and fix documentation to use charArraySet if case should be ignored. At 
the same time we should introduce a getter to charArraySet and only create a 
new set if the boolean given and the ignorecase member in CharArraySet does not 
match, provided it is an instance of charArraySet.



 Prepare CharArraySet for Unicode 4.0
 

 Key: LUCENE-2094
 URL: https://issues.apache.org/jira/browse/LUCENE-2094
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.0
Reporter: Simon Willnauer
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
 LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
 LUCENE-2094.txt, LUCENE-2094.txt


 CharArraySet does lowercaseing if created with the correspondent flag. This 
 causes that  String / char[] with uncode 4 chars which are in the set can not 
 be retrieved in ignorecase mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2010-01-07 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797772#action_12797772
 ] 

Simon Willnauer edited comment on LUCENE-2094 at 1/7/10 7:53 PM:
-

bq. Simon, I think yonik refers to this code in stopfilter itself: 
I see, the problem with this piece of code is that it has the caseinsensitive 
flag which would be ignored if I would not create such a set though. As far as 
I can see even previous version did not really do what the javadoc says. 
{code}
  if (stopWords instanceof CharArraySet) {
  this.stopWords = (CharArraySet)stopWords;
} else {
  this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
  this.stopWords.addAll(stopWords);
}
{code}

I agree we should prevent this costly operation but it doesn't seem to be easy 
though. My first impression is to deprecate the ctors which have the ignorecase 
boolean and fix documentation to use charArraySet if case should be ignored. At 
the same time we should introduce a getter to charArraySet and only create a 
new set if the boolean given and the ignorecase member in CharArraySet does not 
match, provided it is an instance of charArraySet.

This should also be backported to 2.9 / 3.0 to enable solr to at least fix 
things where possible.



  was (Author: simonw):
bq. Simon, I think yonik refers to this code in stopfilter itself: 
I see, the problem with this piece of code is that it has the caseinsensitive 
flag which would be ignored if I would not create such a set though. As far as 
I can see even previous version did not really do what the javadoc says. 
{code}
  if (stopWords instanceof CharArraySet) {
  this.stopWords = (CharArraySet)stopWords;
} else {
  this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
  this.stopWords.addAll(stopWords);
}
{code}

I agree we should prevent this costly operation but it doesn't seem to be easy 
though. My first impression is to deprecate the ctors which have the ignorecase 
boolean and fix documentation to use charArraySet if case should be ignored. At 
the same time we should introduce a getter to charArraySet and only create a 
new set if the boolean given and the ignorecase member in CharArraySet does not 
match, provided it is an instance of charArraySet.


  
 Prepare CharArraySet for Unicode 4.0
 

 Key: LUCENE-2094
 URL: https://issues.apache.org/jira/browse/LUCENE-2094
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.0
Reporter: Simon Willnauer
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
 LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
 LUCENE-2094.txt, LUCENE-2094.txt


 CharArraySet does lowercaseing if created with the correspondent flag. This 
 causes that  String / char[] with uncode 4 chars which are in the set can not 
 be retrieved in ignorecase mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2010-01-07 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797772#action_12797772
 ] 

Simon Willnauer edited comment on LUCENE-2094 at 1/7/10 8:07 PM:
-

bq. Simon, I think yonik refers to this code in stopfilter itself: 
Thanks god jira lets me edit my comments :)
My X60 was too small to spot the comment about charArraySet and ingoreCase. 
This is absolutely true - this issue introduced this change and it should 100% 
use CharArraySet.copy instead of constructing a new CharArraySet

I will create a new issue and upload a patch.

  was (Author: simonw):
bq. Simon, I think yonik refers to this code in stopfilter itself: 
I see, the problem with this piece of code is that it has the caseinsensitive 
flag which would be ignored if I would not create such a set though. As far as 
I can see even previous version did not really do what the javadoc says. 
{code}
  if (stopWords instanceof CharArraySet) {
  this.stopWords = (CharArraySet)stopWords;
} else {
  this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
  this.stopWords.addAll(stopWords);
}
{code}

I agree we should prevent this costly operation but it doesn't seem to be easy 
though. My first impression is to deprecate the ctors which have the ignorecase 
boolean and fix documentation to use charArraySet if case should be ignored. At 
the same time we should introduce a getter to charArraySet and only create a 
new set if the boolean given and the ignorecase member in CharArraySet does not 
match, provided it is an instance of charArraySet.

This should also be backported to 2.9 / 3.0 to enable solr to at least fix 
things where possible.


  
 Prepare CharArraySet for Unicode 4.0
 

 Key: LUCENE-2094
 URL: https://issues.apache.org/jira/browse/LUCENE-2094
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.0
Reporter: Simon Willnauer
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
 LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
 LUCENE-2094.txt, LUCENE-2094.txt


 CharArraySet does lowercaseing if created with the correspondent flag. This 
 causes that  String / char[] with uncode 4 chars which are in the set can not 
 be retrieved in ignorecase mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet

2010-01-07 Thread Simon Willnauer (JIRA)
StopFilter should not create a new CharArraySet if the given set is already an 
instance of CharArraySet
---

 Key: LUCENE-2197
 URL: https://issues.apache.org/jira/browse/LUCENE-2197
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Critical
 Fix For: 3.1


With LUCENE-2094 a new CharArraySet is created no matter what type of set is 
passed to StopFilter. This does not behave as  documented and could introduce 
serious performance problems. Yet, according to the javadoc, the instance of 
CharArraySet should be passed to CharArraySet.copy (which is very fast for 
CharArraySet instances) instead of copied via new CharArraySet()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet

2010-01-07 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2197:


Attachment: LUCENE-2197.patch

 StopFilter should not create a new CharArraySet if the given set is already 
 an instance of CharArraySet
 ---

 Key: LUCENE-2197
 URL: https://issues.apache.org/jira/browse/LUCENE-2197
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.1
Reporter: Simon Willnauer
Priority: Critical
 Fix For: 3.1

 Attachments: LUCENE-2197.patch


 With LUCENE-2094 a new CharArraySet is created no matter what type of set is 
 passed to StopFilter. This does not behave as  documented and could introduce 
 serious performance problems. Yet, according to the javadoc, the instance of 
 CharArraySet should be passed to CharArraySet.copy (which is very fast for 
 CharArraySet instances) instead of copied via new CharArraySet()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2147) Improve Spatial Utility like classes

2010-01-05 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-2147.
-

   Resolution: Fixed
Fix Version/s: 3.1

Committed in revision 896240

Thanks Chris

 Improve Spatial Utility like classes
 

 Key: LUCENE-2147
 URL: https://issues.apache.org/jira/browse/LUCENE-2147
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 3.1
Reporter: Chris Male
Assignee: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, 
 LUCENE-2147.patch, LUCENE-2147.patch


 - DistanceUnits can be improved by giving functionality to the enum, such as 
 being able to convert between different units, and adding tests.  
 - GeoHashUtils can be improved through some code tidying, documentation, and 
 tests.
 - SpatialConstants allows us to move all constants, such as the radii and 
 circumferences of Earth, to a single consistent location that we can then use 
 throughout the contrib.  This also allows us to improve the transparency of 
 calculations done in the contrib, as users of the contrib can easily see the 
 values being used.  Currently this issues does not migrate classes to use 
 these constants, that will happen in issues related to the appropriate 
 classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2188) A handy utility class for tracking deprecated overridden methods

2010-01-05 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796910#action_12796910
 ] 

Simon Willnauer commented on LUCENE-2188:
-

Uwe, 
I'm not sure if I have a really good replacement for the your names, none of 
the following suggestions seem to be a 100% match though.

for getOverrideDistance() you could call it:
 * getDefinitionDistanceFrom(Class)
 * getImplementationDistanceFrom(Class)

The term distance is fine IMO, I would rather extend the javadoc a little and 
explain that this is the distance between the given class and the next class 
implementing the method on the path from the given class to the base class 
where the method was initally declared / defined

for isOverriddenBy() you could call it:
 * isDefinedBy()
 * isImplementedBy()

I also wanna mention an option for the class name, VirtualMethod pretty much 
matches what this class represents. :) 

 A handy utility class for tracking deprecated overridden methods
 

 Key: LUCENE-2188
 URL: https://issues.apache.org/jira/browse/LUCENE-2188
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch, 
 LUCENE-2188.patch


 This issue provides a new handy utility class that keeps track of overridden 
 deprecated methods in non-final sub classes. This class can be used in new 
 deprecations.
 See the javadocs for an example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

2010-01-03 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795974#action_12795974
 ] 

Simon Willnauer commented on LUCENE-2183:
-

bq. This issue is blocked by: ...
I give up... 

 Supplementary Character Handling in CharTokenizer
 -

 Key: LUCENE-2183
 URL: https://issues.apache.org/jira/browse/LUCENE-2183
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2183.patch


 CharTokenizer is an abstract base class for all Tokenizers operating on a 
 character level. Yet, those tokenizers still use char primitives instead of 
 int codepoints. CharTokenizer should operate on codepoints and preserve bw 
 compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

2010-01-02 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795882#action_12795882
 ] 

Simon Willnauer commented on LUCENE-2034:
-

Robert, I see what you are alluding to. Yet, I agree this is a new issue and 
should be handled separately. The issues would require some changes in the api 
I guess or rather additions. Yet, we should commit this regardless! I would be 
happy to make additions to StopwordAnalyzerBase on another issue as long as we 
haven't released this code we can still change the API while I don't think we 
have to. #getStopwordSet will always return the set in use while setting the 
stopwordset depending on the version is internal to the class. 



 Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
 -

 Key: LUCENE-2034
 URL: https://issues.apache.org/jira/browse/LUCENE-2034
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
 LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
 LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
 LUCENE-2034.txt


 Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
 need to implement at least one of the methodes returning a tokenStream. When 
 you look at the code it appears to be almost identical if both are 
 implemented in the same analyzer.  Each analyzer defnes the same inner class 
 (SavedStreams) which is unnecessary.
 In contrib almost every analyzer uses stopwords and each of them creates his 
 own way of loading them or defines a large number of ctors to load stopwords 
 from a file, set, arrays etc.. those ctors should be removed / deprecated and 
 eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2147) Improve Spatial Utility like classes

2010-01-02 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795892#action_12795892
 ] 

Simon Willnauer commented on LUCENE-2147:
-

{quote}
I'd say that we remove the flux warnings, but instead put a note in the top 
level that since this is a contrib module, it will not adhere to Lucene core's 
strict back compat. policy. 
{quote}
that sounds good, I will put it into a package.html doc and will also add a 
readme to the project itself.

I think this issue is good to go. I will commit this is a few days if nobody 
objects.



 Improve Spatial Utility like classes
 

 Key: LUCENE-2147
 URL: https://issues.apache.org/jira/browse/LUCENE-2147
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 3.1
Reporter: Chris Male
Assignee: Simon Willnauer
 Attachments: LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, 
 LUCENE-2147.patch, LUCENE-2147.patch


 - DistanceUnits can be improved by giving functionality to the enum, such as 
 being able to convert between different units, and adding tests.  
 - GeoHashUtils can be improved through some code tidying, documentation, and 
 tests.
 - SpatialConstants allows us to move all constants, such as the radii and 
 circumferences of Earth, to a single consistent location that we can then use 
 throughout the contrib.  This also allows us to improve the transparency of 
 calculations done in the contrib, as users of the contrib can easily see the 
 values being used.  Currently this issues does not migrate classes to use 
 these constants, that will happen in issues related to the appropriate 
 classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2147) Improve Spatial Utility like classes

2009-12-30 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795277#action_12795277
 ] 

Simon Willnauer commented on LUCENE-2147:
-

Since this is the first issue which comes near to be committed some questions 
arise from my side if we should mark the new API as experimental like the 
function API in o.a.l.s.function. I think it would make sense to keep a warning 
that contrib/spatial might slightly change in the future.
On the other hand we should try to put more confidence into contrib/spatial for 
more user acceptance. I currently work for customers that moved away from 
spatial due to its early stage and flux warnings which is quite 
understandable though. I would like to hear other opinions regarding this topic 
- especially opinions of more experienced committers would be appreciated.

 Improve Spatial Utility like classes
 

 Key: LUCENE-2147
 URL: https://issues.apache.org/jira/browse/LUCENE-2147
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 3.1
Reporter: Chris Male
Assignee: Simon Willnauer
 Attachments: LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, 
 LUCENE-2147.patch, LUCENE-2147.patch


 - DistanceUnits can be improved by giving functionality to the enum, such as 
 being able to convert between different units, and adding tests.  
 - GeoHashUtils can be improved through some code tidying, documentation, and 
 tests.
 - SpatialConstants allows us to move all constants, such as the radii and 
 circumferences of Earth, to a single consistent location that we can then use 
 throughout the contrib.  This also allows us to improve the transparency of 
 calculations done in the contrib, as users of the contrib can easily see the 
 values being used.  Currently this issues does not migrate classes to use 
 these constants, that will happen in issues related to the appropriate 
 classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2173) Simplify and tidy Cartesian Tier Code in Spatial

2009-12-30 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2173:
---

Assignee: Simon Willnauer

 Simplify and tidy Cartesian Tier Code in Spatial
 

 Key: LUCENE-2173
 URL: https://issues.apache.org/jira/browse/LUCENE-2173
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 3.1
Reporter: Chris Male
Assignee: Simon Willnauer
 Attachments: LUCENE-2173.patch, LUCENE-2173.patch, LUCENE-2173.patch


 The Cartesian Tier filtering code in the spatial code can be simplified, 
 tidied and generally improved.  Improvements include removing default field 
 name support which isn't the responsibility of the code, adding javadoc, 
 making method names more intuitive and trying to make the complex code in 
 CartesianPolyFilterBuilder more understandable.
 Few deprecations have to occur as part of this work, but some public methods 
 in CartesianPolyFilterBuilder will be made private where possible so future 
 improvements of this class can occur.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2174) Add new SpatialFilter and DistanceFieldComparatorSource to Spatial

2009-12-30 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2174:
---

Assignee: Simon Willnauer

 Add new SpatialFilter and DistanceFieldComparatorSource to Spatial
 --

 Key: LUCENE-2174
 URL: https://issues.apache.org/jira/browse/LUCENE-2174
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 3.1
Reporter: Chris Male
Assignee: Simon Willnauer
 Attachments: LUCENE-2174.patch


 The current DistanceQueryBuilder and DistanceFieldComparatorSource in Spatial 
 are based on the old filtering process, most of which has been deprecated in 
 previous issues.  These will be replaced by a new SpatialFilter class, which 
 is a proper Lucene filter, and a new DistanceFieldComparatorSource which will 
 be relocated and will use the new DistanceFilter interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2152) Abstract Spatial distance filtering process and supported field formats

2009-12-30 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795282#action_12795282
 ] 

Simon Willnauer commented on LUCENE-2152:
-

Chris, indeed this is a tricky one. One problem with arises related to the map 
used for distance caching is when you want to use spatial with a filter and 
sort in contrib/remote. At least in the current code (not your patch - haven't 
looked at it yet though) the sort instance is obtained from the filter and 
depends on the map instance filled by the filter. After serialization the 
instance disappears and sort doesn't work anymore on the remote side. if we 
could decouple the distance storage from the filter implementation we could 
also come up with a solution for the sorting problem like providing a 
remoteCollector that has any key value lookup function (internally) that can be 
used by the sort function to lookup the calculated values.

I personally would go one step further and introduce an exchangeable distance 
calculation function in the first step and a collector in the second. It is 
even possible to introduce a delegation approach like the following example:
{code}
DistanceFunction func = new MapCachingDistFunc(new DefaultDistanceFunc(new 
CustomFieldDecoder());

for docId in docs:
  if func(docid, reader, point) = dist:
bitSet.set(docid)
{code}

That way we could completely separate the problem into a function interface / 
abstract class and can provide several implementations. It would also be 
possible to solve our problem with sorting where we can pass a special 
RemoteDistanceFunction to both the sort and filter impl. I don't know how it 
would look like in the impl though. 

Maybe we can even use this function interface in the customscorequery as well. 

Just some random ideas

 Abstract Spatial distance filtering process and supported field formats
 ---

 Key: LUCENE-2152
 URL: https://issues.apache.org/jira/browse/LUCENE-2152
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 3.1
Reporter: Chris Male
 Attachments: LUCENE-2152.patch, LUCENE-2152.patch, LUCENE-2152.patch


 Currently the second stage of the filtering process in the spatial contrib 
 involves calculating the exact distance for the remaining documents, and 
 filtering out those that fall out of the search radius.  Currently this is 
 done through the 2 impls of DistanceFilter, LatLngDistanceFilter and 
 GeoHashDistanceFilter.  The main difference between these 2 impls is the 
 format of data they support, the former supporting lat/lngs being stored in 2 
 distinct fields, while the latter supports geohashed lat/lngs through the 
 GeoHashUtils.  This difference should be abstracted out so that the distance 
 filtering process is data format agnostic.
 The second issue is that the distance filtering algorithm can be considerably 
 optimized by using multiple-threads.  Therefore it makes sense to have an 
 abstraction of DistanceFilter which has different implementations, one being 
 a multi-threaded implementation and the other being a blank implementation 
 that can be used when no distance filtering is to occur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2152) Abstract Spatial distance filtering process and supported field formats

2009-12-30 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795295#action_12795295
 ] 

Simon Willnauer commented on LUCENE-2152:
-

bq. Given that, I'm still sort of favouring separating the distance calculation 
function from the storage mechanism.

The actual reasons why I proposed it that way are kind of special. Imagine you 
do a search 1mile around point X, the next search is 2 miles around point X. 
Yet for such a case you could simply wrap the function in another cache 
function using the already existing cache as second level cache. All the logic 
for that would be encapsulated in a simple function. None of the logic would be 
necessary in any of the implementations like CustomScoreQuery, Sort or Filter. 
Yet if you separate them into two interfaces (not necessarily the java 
interface) you would have to have some logic which checks if the values is 
already cached somewhere. 
I'm not bound to this solution just throwing in randoms thoughts which could be 
useful for users to some extend. For me a distance is just a function and I 
don't care if it is cached or not. The logic which takes care on caching should 
be completely transparent IMO. If possible we should prevent calls inside the 
filter etc. like:

{code}
if(cached):
  getFromCache()
else
  getFromFunc()
{code}

 Abstract Spatial distance filtering process and supported field formats
 ---

 Key: LUCENE-2152
 URL: https://issues.apache.org/jira/browse/LUCENE-2152
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 3.1
Reporter: Chris Male
 Attachments: LUCENE-2152.patch, LUCENE-2152.patch, LUCENE-2152.patch


 Currently the second stage of the filtering process in the spatial contrib 
 involves calculating the exact distance for the remaining documents, and 
 filtering out those that fall out of the search radius.  Currently this is 
 done through the 2 impls of DistanceFilter, LatLngDistanceFilter and 
 GeoHashDistanceFilter.  The main difference between these 2 impls is the 
 format of data they support, the former supporting lat/lngs being stored in 2 
 distinct fields, while the latter supports geohashed lat/lngs through the 
 GeoHashUtils.  This difference should be abstracted out so that the distance 
 filtering process is data format agnostic.
 The second issue is that the distance filtering algorithm can be considerably 
 optimized by using multiple-threads.  Therefore it makes sense to have an 
 abstraction of DistanceFilter which has different implementations, one being 
 a multi-threaded implementation and the other being a blank implementation 
 that can be used when no distance filtering is to occur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

2009-12-29 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795043#action_12795043
 ] 

Simon Willnauer commented on LUCENE-2183:
-

Hey guys thanks for your comments.
when I started thinking about this issue I had a quick chat with robert and we 
figured that his solution could be working so I implemented it.
Yet, i found 2 problems with it.
1. If a user calls super.isTokenChar(char) and the super class has implemented 
the int method the UOE will never be thrown and the code does not behave like 
expected from the user perspective. - This is what robert explained above. We 
could solve this problem with reflection which leads to the second problem.

2. If a Tokenizer like LowerCaseTokenizer only overrides normalize(char|int) it 
relies on the superclass implementation of isTokenChar. Yet if we solve problem 
1. the user would be forced to override the isTokenChar to just call 
super.isTokenChar otherwise the reflection code will raise an exception that 
the int method is not implemented in the concrete class or will use the char 
API - anyway it will not do what is expected. 

Working around those two problem was the cause of a new API for CharTokenizer. 
My personal opinion is that inheritance is the wrong tool for changing behavior 
I used delegation (like a strategy) to on the one hand define a clear new API 
and decouple the code changing the behavior of the Tokenizer from the tokenizer 
itself. Inheritance for me is for extending a class and delegation is for 
changing behavior in this particular problem. 
Decoupling the old from the new has several advantages over a reflection / 
inheritance based solution. 
1. if a user does not provide a delegation impl he want to use the old API
2. if a user does provide a delegation impl he has still the ability to choose 
between charprocessing in 3.0 style or 3.1 style
3. no matter what is provided a user has full flexibility to choose the 
combination of their choice - old char processing - new int based api (maybe 
minor though)
4. we can leave all tokeinizer subclasses as their are and define new functions 
that implement their behavior in parallel. those functions can be made final 
from the beginning and which prevents users from subclassing them. (all of the 
existing ones should be final in my opinion - like LowerCaseTokenizer which 
should call Character.isLetter in the isTokenCodePoint(int) directly instead of 
subclassing another function.)

As a user I would expect lucene to revise their design decisions made years ago 
when there is a need for it like we have in this issue. It is easier to change 
behavior in user code by swapping to a new api instead of diggin into an 
workaround implementation of an old api silently calling a new API.



 Supplementary Character Handling in CharTokenizer
 -

 Key: LUCENE-2183
 URL: https://issues.apache.org/jira/browse/LUCENE-2183
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2183.patch


 CharTokenizer is an abstract base class for all Tokenizers operating on a 
 character level. Yet, those tokenizers still use char primitives instead of 
 int codepoints. CharTokenizer should operate on codepoints and preserve bw 
 compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2183) Supplementary Character Handling in CharTokenizer

2009-12-29 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795091#action_12795091
 ] 

Simon Willnauer commented on LUCENE-2183:
-

{quote}
#2 is no problem at all, instead the reflection code to address #1 must be 
implemented with these conditions

* A is the class implementing method isTokenChar(int)
* B is the class implementing method isTokenChar(char)
* B is a subclass of A
* A is not CharTokenizer
{quote}

ok here is a scenario:
{code}
class MySmartDeseretTokenizer extends LetterTokenizer {
  
  public boolean isTokenChar(char c) {
// we trust that DeseretHighLow surrogates are never unpaired
return super.isTokenChar(c) || isDeseretHighLowSurrogate(c);
  }

  public char nomalize(char c) {
if(isDeseretHighSurrogate(c))
  return c;
if(isDeseretLowSurrogate(c))
 return lowerCaseDeseret('\ud801', c)[1];
return Character.toLowercase(c);
  }

  public int normalize(int c) {
return Character.toLowerCase(c);
  }
}

{code}

if somebody has similar code like this they might want to preserve compat 
because they have different versions of their app. Yet the old app only 
supports deseret high surrogates but the new one accepts all letter 
supplementary chars due to super.isTokenChar(int). This scenario will break our 
reflection solution and users might be disappointed though as the new api is 
there to bring the unicode support. I don't say this scenario exists but it 
could be a valid one for a very special usecase. 

I don't say my proposal is THE way to go but I really don't want to use 
reflection - this would make things worse IMO. 
Lets find a solution that fits to all scenarios.

bq. in the design you propose under the new api, subclassing is impossible, 
which I am not sure I like either.

Hmm, that is not true. You can still subclass and pass your impl up to the 
superclass. I haven't implemented that yet but this is def. possible.

 Supplementary Character Handling in CharTokenizer
 -

 Key: LUCENE-2183
 URL: https://issues.apache.org/jira/browse/LUCENE-2183
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Simon Willnauer
 Fix For: 3.1

 Attachments: LUCENE-2183.patch


 CharTokenizer is an abstract base class for all Tokenizers operating on a 
 character level. Yet, those tokenizers still use char primitives instead of 
 int codepoints. CharTokenizer should operate on codepoints and preserve bw 
 compatibility. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2147) Improve Spatial Utility like classes

2009-12-28 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2147:


Attachment: LUCENE-2147.patch

Chris, this seems to be ready to be committed soon. I removed the flux 
warnings in the class JavaDocs, converted the tests to junit 4 and added a 
CHANGES.TXT notice to make it ready to be committed.


 Improve Spatial Utility like classes
 

 Key: LUCENE-2147
 URL: https://issues.apache.org/jira/browse/LUCENE-2147
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 3.1
Reporter: Chris Male
Assignee: Simon Willnauer
 Attachments: LUCENE-2147.patch, LUCENE-2147.patch, LUCENE-2147.patch, 
 LUCENE-2147.patch, LUCENE-2147.patch


 - DistanceUnits can be improved by giving functionality to the enum, such as 
 being able to convert between different units, and adding tests.  
 - GeoHashUtils can be improved through some code tidying, documentation, and 
 tests.
 - SpatialConstants allows us to move all constants, such as the radii and 
 circumferences of Earth, to a single consistent location that we can then use 
 throughout the contrib.  This also allows us to improve the transparency of 
 calculations done in the contrib, as users of the contrib can easily see the 
 values being used.  Currently this issues does not migrate classes to use 
 these constants, that will happen in issues related to the appropriate 
 classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   3   4   5   >