date:20090716


[ 
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731880#action_12731880
 ] 

Michael McCandless commented on LUCENE-1566:


SimpleFSDirectory is missing from the last patch?

 Large Lucene index can hit false OOM due to Sun JRE issue
 -

 Key: LUCENE-1566
 URL: https://issues.apache.org/jira/browse/LUCENE-1566
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, 
 LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch


 This is not a Lucene issue, but I want to open this so future google
 diggers can more easily find it.
 There's this nasty bug in Sun's JRE:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546
 The gist seems to be, if you try to read a large (eg 200 MB) number of
 bytes during a single RandomAccessFile.read call, you can incorrectly
 hit OOM.  Lucene does this, with norms, since we read in one byte per
 doc per field with norms, as a contiguous array of length maxDoc().
 The workaround was a custom patch to do large file reads as several
 smaller reads.
 Background here:
   http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue

2009-07-16 Thread Simon Willnauer (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731886#action_12731886
]

Simon Willnauer commented on LUCENE-1566:
-

bq. SimpleFSDirectory is missing from the last patch?

ups! :)

Large Lucene index can hit false OOM due to Sun JRE issue
-

Key: LUCENE-1566
URL: https://issues.apache.org/jira/browse/LUCENE-1566
Project: Lucene - Java
Issue Type: Bug
Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1566.patch, LUCENE-1566.patch,
LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch,
LUCENE_1566_IndexInput_Changes.patch

This is not a Lucene issue, but I want to open this so future google
diggers can more easily find it.
There's this nasty bug in Sun's JRE:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546
The gist seems to be, if you try to read a large (eg 200 MB) number of
bytes during a single RandomAccessFile.read call, you can incorrectly
hit OOM. Lucene does this, with norms, since we read in one byte per
doc per field with norms, as a contiguous array of length maxDoc().
The workaround was a custom patch to do large file reads as several
smaller reads.
Background here:
http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731893#action_12731893
]

Uwe Schindler commented on LUCENE-1693:
---

Ok looks good. I think you will go to bed now, so the work would not collide.
If you start to program again, ask me, that I will post a patch (which makes
merging simplier). TortoiseSVN has a problem with merging added files, so when
applying your patch I have to remove them first :-(

Some comments:
- TeeSinkTokenFilter looks good, I think we should also add a test for it (in
principle the version of TestTeeTokenFilter from current trunk, not the one
reverted to old API from the current patch)
- I do not understand completely why this WeakReference is needed between Tee
and Sink? If it is needed, the code may fail with NPE, when Reference.get()
returns null. The idea is, that one can create a Sink for the Tee and throw the
Sink away. Tee would then simply not pass the attributes anymore to the sink?
If this is the case, the check for Reference.get()==null is really missing.
- Should I implement CachingAttributesFilter as replacement for
CachingTokenFilter, or will you do it together with TeeSink?

I will now start to add all the finals to the missing core analyzers.

bq. The only small performance improvement we should probably make is to avoid
checking which method in TokenStream is overridden when onlyUseNewAPI==true

I could disable this for next() and next(Token). In the case of incrementToken,
it should really check, that it is enabled, because not doing so would fail
hard create endless loops. So the check should be there in all cases. But if
onlyUseNewAPI is enabled, I could simply define hasNext and
hasReusableNext=false. I will do this.

AttributeSource/TokenStream API improvements

Key: LUCENE-1693
URL: https://issues.apache.org/jira/browse/LUCENE-1693
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Fix For: 2.9

Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java,
TestCompatibility.java, TestCompatibility.java, TestCompatibility.java,
TestCompatibility.java

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-16 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731896#action_12731896
]

Grant Ingersoll commented on LUCENE-1693:
-

Favor to ask, when this is ready to commit, can you give a few days notice so
that the rest of us can look at it before committing? I've been keeping up
with the comments, but not the patches.

AttributeSource/TokenStream API improvements

This patch makes the following improvements to AttributeSource and
TokenStream/Filter:
- removes the set/getUseNewAPI() methods (including the standard
ones). Instead by default incrementToken() throws a subclass of
UnsupportedOperationException. The indexer tries to call
incrementToken() initially once to see if the exception is thrown;
if so, it falls back to the old API.
- introduces interfaces for all Attributes. The corresponding
implementations have the postfix 'Impl', e.g. TermAttribute and
TermAttributeImpl. AttributeSource now has a factory for creating
the Attribute instances; the default implementation looks for
implementing classes with the postfix 'Impl'. Token now implements
all 6 TokenAttribute interfaces.
- new method added to AttributeSource:
addAttributeImpl(AttributeImpl). Using reflection it walks up in the
class hierarchy of the passed in object and finds all interfaces
that the class or superclasses implement and that extend the
Attribute interface. It then adds the interface-instance mappings
to the attribute map for each of the found interfaces.
- AttributeImpl now has a default implementation of toString that uses
reflection to print out the values of the attributes in a default
formatting. This makes it a bit easier to implement AttributeImpl,
because toString() was declared abstract before.
- Cloning is now done much more efficiently in
captureState. The method figures out which unique AttributeImpl
instances are contained as values in the attributes map, because
those are the ones that need to be cloned. It creates a single
linked list that supports deep cloning (in the inner class
AttributeSource.State). AttributeSource keeps track of when this
state changes, i.e. whenever new attributes are added to the
AttributeSource. Only in that case will captureState recompute the
state, otherwise it will simply clone the precomputed state and
return the clone. restoreState(AttributeSource.State) walks the
linked list and uses the copyTo() method of AttributeImpl to copy
all values over into the attribute that the source stream
(e.g. SinkTokenizer) uses.
The cloning performance can be greatly improved if not multiple
AttributeImpl instances are used in one TokenStream. A user can
e.g. simply add a Token instance to the stream instead of the individual
attributes. Or the user could implement a subclass of AttributeImpl that
implements exactly the Attribute interfaces needed. I think this
should be considered an expert API (addAttributeImpl), as this manual
optimization is only needed if cloning performance is crucial. I ran
some quick performance tests using Tee/Sink tokenizers (which do
cloning) and the performance was roughly 20% faster with the new
API. I'll run some more performance tests and post more numbers then.
Note also that when we add serialization to the Attributes, e.g. for
supporting storing serialized TokenStreams in the index, then the
serialization should benefit even significantly more from the new API
than cloning.
Also, the TokenStream API does not change, except for the removal
of the set/getUseNewAPI methods. So the patches in LUCENE-1460
should still work.
All core tests pass, however, I need to update all the documentation
and also add some unit tests for the new AttributeSource
functionality. So this patch is not ready to commit yet, but I wanted
to post it already for some feedback.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To

Re: Search in non-linguistic text

2009-07-16 Thread JesL


Ack...  Clicked on the wrong group.  Sorry - I'll move it.
-- 
View this message in context: 
http://www.nabble.com/Search-in-non-linguistic-text-tp24515712p24515926.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans should be abstract

getPayloadSpans on org.apache.lucene.search.spans should be abstract


 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4.1, 2.4
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


I just spent a long time tracking down a bug resulting from upgrading to Lucene 
2.4.1 on a project that implements some SpanQuerys of its own and was written 
against 2.3.  Since the project's SpanQuerys didn't implement getPayloadSpans, 
the call to that method went to SpanQuery.getPayloadSpans which returned null 
and caused a NullPointerException in the Lucene code, far away from the actual 
source of the problem.  

It would be much better for this kind of thing to show up at compile time, I 
think.

Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract


 [ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hugh Cayless updated LUCENE-1748:
-

Summary: getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should 
be abstract  (was: getPayloadSpans on org.apache.lucene.search.spans should be 
abstract)

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731939#action_12731939
 ] 

Earwin Burrfoot commented on LUCENE-1748:
-

bq. Shouldnt it throw a runtime exception (unsupported operation?) or something?
What is the difference between adding an abstract method and adding a method 
that throws exception in regards to jar drop in back compat?
In both cases when you drop your new jar in you get an exception, except in the 
latter case exception is deferred.

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract


[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731940#action_12731940
 ] 

Hugh Cayless commented on LUCENE-1748:
--

Ah.  I figured it would be something like that.  Yes, if abstract isn't 
possible, an UnsupportedOperationException would at least get closer to the 
source of the problem.

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

[
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731940#action_12731940
]

Hugh Cayless edited comment on LUCENE-1748 at 7/16/09 6:43 AM:
---

Ah. I figured it would be something like that. Yes, if abstract isn't
possible, an UnsupportedOperationException would at least get closer to the
source of the problem.

From my perspective at least, backwards compatibility is already broken, since
Lucene doesn't work with SpanQuerys that don't implement getPayloadSpans--but
I understand y'all will have different requirements in this regard.

was (Author: hcayless):
Ah. I figured it would be something like that. Yes, if abstract isn't
possible, an UnsupportedOperationException would at least get closer to the
source of the problem.

getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
--

Key: LUCENE-1748
URL: https://issues.apache.org/jira/browse/LUCENE-1748
Project: Lucene - Java
Issue Type: Bug
Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
Environment: all
Reporter: Hugh Cayless
Fix For: 2.4.2

I just spent a long time tracking down a bug resulting from upgrading to
Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was
written against 2.3. Since the project's SpanQuerys didn't implement
getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans
which returned null and caused a NullPointerException in the Lucene code, far
away from the actual source of the problem.
It would be much better for this kind of thing to show up at compile time, I
think.
Thanks!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Mark Miller

bq. Shouldnt it throw a runtime exception (unsupported operation?) or
something?
What is the difference between adding an abstract method and adding a
method that throws exception in regards to jar drop in back compat?
In both cases when you drop your new jar in you get an exception, except
in the latter case exception is deferred.

Yeah, its dicey - I suppose the idea is that, if you used the code as you
used to, it wouldnt try and call getPayloadSpans? And so if you kept using
non payloadspans functionality, you would be set, and if you tried to use
payloadspans you would get an exception saying the class needed to be
updated? But if you make it abstract, we lose jar drop (I know I've read we
don't have it for this release anyway) in and everyone has to implement the
method. At least with the exception, if you are using the class as you used
to, you can continue to do so with no work? Not that I 've considered it for
very long at the moment.

I know, I see your point - this back compat stuff is always dicey - thats
why I throw it out there with a question mark - hopefully others will
continue to chime in.

On Thu, Jul 16, 2009 at 9:38 AM, Earwin Burrfoot (JIRA) j...@apache.orgwrote:

[
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731939#action_12731939]

Earwin Burrfoot commented on LUCENE-1748:
-

getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be
abstract

I just spent a long time tracking down a bug resulting from upgrading to
Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was
written against 2.3. Since the project's SpanQuerys didn't implement
getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans
which returned null and caused a NullPointerException in the Lucene code,
far away from the actual source of the problem.
It would be much better for this kind of thing to show up at compile
time, I think.
Thanks!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

--
--
- Mark

http://www.lucidimagination.com

Re: latest lucene update

2009-07-16 Thread Yonik Seeley

On Thu, Jul 16, 2009 at 2:11 AM, Uwe Schindleru...@thetaphi.de wrote:
 Did you also test, that the speed was going back to normal with the latest
 fix in trunk (without modifying Solr code)?

I didn't - I was already part way through implementing advance() in Solr.
I'm sure the advance() fix in Lucene would have worked too though.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

[
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731947#action_12731947
]

Uwe Schindler commented on LUCENE-1693:
---

I forgot: I also implemented the final next() methods in all non-final classes.

AttributeSource/TokenStream API improvements

Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch,
TestAPIBackwardsCompatibility.java, TestCompatibility.java,
TestCompatibility.java, TestCompatibility.java, TestCompatibility.java

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements


 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: LUCENE-1693.patch

New patch with some more work. First the phantastic news:

As CachingTokenFilter has no API to access the cached attributes/tokens 
directly, it does not need to be deprecated, it just switched the internal and 
hidden impl to incrementToken() and attributes. I also added an additional test 
in the BW-Testcase, that checks if the caching also works for your strange 
POSTokens. And it works! You can even mix the consumers, e.g. first use new API 
to cache tokens and then replay using the old API. really cool. The problem, 
why the POSToken was not preserved in the past was an error in 
TokenWrapper.copyTo(). This method created a new Token and copied the contents 
into it using reinit(). Now it simply creates a clone and let delegate point to 
it (this is how the caching worked before).

In principle Tee/SinkTokenizer could also work like this, the only problem with 
this class is the fact, that it has a public API that exposes the Token 
instances to the outside. Because of that, there is no way around deprecating.

Your new TeeSinkTokenFilter looks good, it only had one problem:
It used addAttributeImpl to add the attribute of the Tee to the new created 
Sink. Because of this, the sink got the same instance as the parent added. With 
useOnlyNewAPI, this does not have an effect for the standard attributes, as the 
ctor already created a Token instance as implementation and added it to the 
stream, so addAttributeImpl had no effect.
I changed this to use the getAttributeClassesIterator and added a new attribute 
instance for each attribute using addAttribute to the sink. As the factory is 
the same, the attributes are generated in the same way. TeeSinkTokenizer would 
only *not* work correctly if somebody addes an custom instance using 
addAttributeImpl in one ctor of another filter in the chain. In this case, the 
factory would create another impl and restoreState throws IAE. In backwards 
compatibility mode (default) the new created sink and also the tee have always 
the default TokenWrapper implementation, so state restoring also works. You 
only have a problem if you change useOnlyNewAPIU inbetween (which would always 
create corrupt chains).

Another idea would be to clone all attribute impls and then add them to the 
sink - the factory would then not be used?

I started to create a test for the new TeeSinkTokenFilter, but there is one 
thing missing: The original test created a subclass of SinkTokenizer, 
overriding add() to filter the tokens added to the sink. This functionality is 
missing with the new API. The correct workaround would be to plug a filter 
around the sink and filter the tokens there? The problem is then, that the 
cache always contains also non-needed tokens (the old impl would not store them 
in the sink).

Maybe we add the filter to the TeeSinkTokenFilter (getting a State, which would 
not work, as contents of state pkg-private?). Somehow else? Or leave it as it 
is and let the user plug the filter on top of the sink (I prefer this)?

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class

RE: latest lucene update

2009-07-16 Thread Uwe Schindler

OK. At least I have seen a speed up during my tests :). I have the logs
somewhere. Which tests were affected negative, then I can look into the
before/after logs?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Thursday, July 16, 2009 3:53 PM
 To: java-dev@lucene.apache.org
 Subject: Re: latest lucene update
 
 On Thu, Jul 16, 2009 at 2:11 AM, Uwe Schindleru...@thetaphi.de wrote:
  Did you also test, that the speed was going back to normal with the
 latest
  fix in trunk (without modifying Solr code)?
 
 I didn't - I was already part way through implementing advance() in Solr.
 I'm sure the advance() fix in Lucene would have worked too though.
 
 -Yonik
 http://www.lucidimagination.com
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

[
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731968#action_12731968
]

Mark Miller commented on LUCENE-1748:
-

My response sent to mailing list:

bq. Shouldnt it throw a runtime exception (unsupported operation?) or
something?
What is the difference between adding an abstract method and adding a method
that throws exception in regards to jar drop in back compat?
In both cases when you drop your new jar in you get an exception, except in
the latter case exception is deferred.

Yeah, its dicey - I suppose the idea is that, if you used the code as you used
to, it wouldnt try and call getPayloadSpans? And so if you kept using non
payloadspans functionality, you would be set, and if you tried to use
payloadspans you would get an exception saying the class needed to be updated?
But if you make it abstract, we lose jar drop (I know I've read we don't have
it for this release anyway) in and everyone has to implement the method. At
least with the exception, if you are using the class as you used to, you can
continue to do so with no work? Not that I 've considered it for very long at
the moment.

I know, I see your point - this back compat stuff is always dicey - thats why I
throw it out there with a question mark - hopefully others will continue to
chime in.
- Show quoted text -

getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731972#action_12731972
 ] 

Earwin Burrfoot commented on LUCENE-1748:
-

I took a glance at the code, the whole getPayloadSpans deal is a herecy.

Each and every implementation looks like:
  public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException {
return (PayloadSpans) getSpans(reader);
  }

Moving it to the base SpanQuery is broken equally to current solution, but 
yields much less strange copypaste.

I also have a faint feeling that if you expose a method like
ClassA method();
you can then upgrade it to
SubclassOfClassA method();
without breaking drop-in compatibility, which renders getPayloadSpans vs 
getSpans alternative totally useless

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract


[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731971#action_12731971
 ] 

Mark Miller commented on LUCENE-1748:
-

bq. From my perspective at least, backwards compatibility is already broken, 
since Lucene doesn't work with SpanQuerys that don't implement getPayloadSpans

Ah, I see - I hadn't looked at this issue in a long time. It looks like you 
must implement it to do much of anything right?

We need to address this better - perhaps abstract is the way to go.

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731972#action_12731972
 ] 

Earwin Burrfoot edited comment on LUCENE-1748 at 7/16/09 7:54 AM:
--

I took a glance at the code, the whole getPayloadSpans deal is a herecy.

Each and every implementation looks like:
  public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException {
return (PayloadSpans) getSpans(reader);
  }

Moving it to the base SpanQuery is broken equally to current solution, but 
yields much less strange copypaste.

-I also have a faint feeling that if you expose a method like-
-ClassA method();-
-you can then upgrade it to-
-SubclassOfClassA method();-
-without breaking drop-in compatibility, which renders getPayloadSpans vs 
getSpans alternative totally useless-
Ok, I'm wrong.

  was (Author: earwin):
I took a glance at the code, the whole getPayloadSpans deal is a herecy.

Each and every implementation looks like:
  public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException {
return (PayloadSpans) getSpans(reader);
  }

Moving it to the base SpanQuery is broken equally to current solution, but 
yields much less strange copypaste.

I also have a faint feeling that if you expose a method like
ClassA method();
you can then upgrade it to
SubclassOfClassA method();
without breaking drop-in compatibility, which renders getPayloadSpans vs 
getSpans alternative totally useless
  
 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

[
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731984#action_12731984
]

Mark Miller commented on LUCENE-1748:
-

bq. the whole getPayloadSpans deal is a herecy.

heh. don't dig too deep - it also has to load all of the payloads as it matches
whether you ask for them or not (if they exist).

The ordered or unordered matcher also has to load them and dump them in certain
situation when they are not actually needed.

Lets look at what we need to do to fix this - we don't have to worry too much
about back compat, cause its already pretty screwed I think.

getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: DISI semantics

2009-07-16 Thread Shai Erera

Uwe / Yonik, DISI's class javadoc states this:

Implementations of this class are expected to consider {...@link
Integer#MAX_VALUE} as an invalid value.

Therefore last cannot be set to MAX_VAL in the above example, if it wants
to be a DISI at least.

Phew ... that was a long issue. I was able to find the conversation on -1
vs. any value before the first there:
https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12714298page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12714298

That link points to my response to Mike w/ why I think it'd be wrong to
relax the policy of docId(). You can read 1-2 comments up and down to get
the full conversation.

In short, if we don't document clearly what is returned by docId() before
the iteration started, it will be hard for a code which receives a DISI to
determine whether to call nextDoc() or start by collecting what docId()
returns. Can be worked around though, but I think the API is clear now and
does not leave room for interpretation.

Shai

On Thu, Jul 16, 2009 at 5:29 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Wed, Jul 15, 2009 at 6:55 PM, Michael
 McCandlessluc...@mikemccandless.com wrote:
  I believe we debated allowing the DISI to return any docID less than
  its first real docID, not only -1, as you've done here, but I think
  Shai found something wrong with that IIRC... but I can't find this
  discussion.  Shai do you remember / can you find this past discussion
  / am I just hallucinating?

 I don't know if it exists in Lucene, but I guess I can see the benefit
 of only having -1 or NO_MORE_DOCS.
 Consider a simplified ConjunctionScorer that didn't do anything in the
 constructor but simply skipped one iterator and then did the logic of
 doNext() until they all matched.  One could get a false hit with my
 theoretical SliceDocIdSetIterator above.

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

[
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731979#action_12731979
]

Mark Miller commented on LUCENE-1748:
-

Okay, so it says: Implementing classes that want access to the payloads will
need to implement this.

But in reality, if you don't implement it, looks like your screwed if you add
it to the container SpanQueries. whether you access the payloads or not.

getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue

[
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-1566:
---

Attachment: LUCENE-1566.patch

OK I reworked the patch some, tweaking javadocs, changes, etc., and
simplifying the loops that read the bytes inside NIOFSDir
SimpleFSDir. I think it's ready to commit. Simon can you take a
look? Thanks.

Large Lucene index can hit false OOM due to Sun JRE issue
-

Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE-1566.patch,
LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch,
LUCENE_1566_IndexInput_Changes.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1505) Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils


[ 
https://issues.apache.org/jira/browse/LUCENE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731993#action_12731993
 ] 

Michael McCandless commented on LUCENE-1505:


bq. For completeness, shoudl we also add them for the ones with the shift value 
at the end? an char[]? I was reluctant to do this.

Let's hold off  add these when the need first arises?

bq. I wonder if it would make sense to do some cleanup in the code (final vars 
and args etc.) and if we should remove this logging code

Agreed -- looks like you've opened a new issue for this already; thanks!

I'll commit shortly.

 Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils
 -

 Key: LUCENE-1505
 URL: https://issues.apache.org/jira/browse/LUCENE-1505
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Reporter: Ryan McKinley
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1505.patch


 Currently spatial contrib includes a copy of NumberUtils from solr (otherwise 
 it would depend on solr)
 Once LUCENE-1496 is sorted out, this copy should be removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: DISI semantics

2009-07-16 Thread Uwe Schindler

OK, that makes sense: So the example of Yonik should be interpreted like
this (I think this is the optimal solution as it does not use an additional
if-clause to check if the iteration has already started):

 

class SliceDocIdSetIterator extends DocIdSetIterator {

 private int doc=-1,act,last;

 

 public SliceDocIdSetIterator(int first, int last) {

   this.act=first-1; this.last=last;

 }

 

 public int docID() {

   return doc;

 }

 

 public int nextDoc() throws IOException {

   if (++actlast) act=NO_MORE_DOCS;

   return doc = act;

 }

 

 public int advance(int target) throws IOException {

   act=target;

   if (actlast) act=NO_MORE_DOCS;

   return doc = act;

 }

}

 

 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: Shai Erera [mailto:ser...@gmail.com] 
Sent: Thursday, July 16, 2009 5:04 PM
To: java-dev@lucene.apache.org; yo...@lucidimagination.com
Subject: Re: DISI semantics

 

Uwe / Yonik, DISI's class javadoc states this:

Implementations of this class are expected to consider {...@link
Integer#MAX_VALUE} as an invalid value.

Therefore last cannot be set to MAX_VAL in the above example, if it wants
to be a DISI at least.

Phew ... that was a long issue. I was able to find the conversation on -1
vs. any value before the first there:
https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12714298
https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12714298
page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#act
ion_12714298
page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#act
ion_12714298

That link points to my response to Mike w/ why I think it'd be wrong to
relax the policy of docId(). You can read 1-2 comments up and down to get
the full conversation.

In short, if we don't document clearly what is returned by docId() before
the iteration started, it will be hard for a code which receives a DISI to
determine whether to call nextDoc() or start by collecting what docId()
returns. Can be worked around though, but I think the API is clear now and
does not leave room for interpretation.

Shai

On Thu, Jul 16, 2009 at 5:29 PM, Yonik Seeley yo...@lucidimagination.com
wrote:

On Wed, Jul 15, 2009 at 6:55 PM, Michael
McCandlessluc...@mikemccandless.com wrote:
 I believe we debated allowing the DISI to return any docID less than
 its first real docID, not only -1, as you've done here, but I think
 Shai found something wrong with that IIRC... but I can't find this
 discussion.  Shai do you remember / can you find this past discussion
 / am I just hallucinating?

I don't know if it exists in Lucene, but I guess I can see the benefit
of only having -1 or NO_MORE_DOCS.
Consider a simplified ConjunctionScorer that didn't do anything in the
constructor but simply skipped one iterator and then did the logic of
doNext() until they all matched.  One could get a false hit with my
theoretical SliceDocIdSetIterator above.


-Yonik
http://www.lucidimagination.com

-

To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1505) Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils


 [ 
https://issues.apache.org/jira/browse/LUCENE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1505.


Resolution: Fixed

 Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils
 -

 Key: LUCENE-1505
 URL: https://issues.apache.org/jira/browse/LUCENE-1505
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Reporter: Ryan McKinley
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1505.patch


 Currently spatial contrib includes a copy of NumberUtils from solr (otherwise 
 it would depend on solr)
 Once LUCENE-1496 is sorted out, this copy should be removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: DISI semantics

2009-07-16 Thread Yonik Seeley

Agreed - that looks like the optimal solution.

-Yonik
http://www.lucidimagination.com

On Thu, Jul 16, 2009 at 11:40 AM, Uwe Schindleru...@thetaphi.de wrote:
 OK, that makes sense: So the example of Yonik should be interpreted like
 this (I think this is the optimal solution as it does not use an additional
 if-clause to check if the iteration has already started):



 class SliceDocIdSetIterator extends DocIdSetIterator {

  private int doc=-1,act,last;



  public SliceDocIdSetIterator(int first, int last) {

    this.act=first-1; this.last=last;

  }



  public int docID() {

    return doc;

  }



  public int nextDoc() throws IOException {

    if (++actlast) act=NO_MORE_DOCS;

    return doc = act;

  }



  public int advance(int target) throws IOException {

    act=target;

    if (actlast) act=NO_MORE_DOCS;

    return doc = act;

  }

 }

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: DISI semantics

2009-07-16 Thread Shai Erera

Of course - if you don't plan to push this DISI into uncontrolled land,
you can use the previous solution as well. I.e., if you never rely on docId
to know whether to start the iteration, and don't pass this DISI to Lucene
somehow etc., there's no need to use act or adhere completely to the API.

Otherwise, I agree, this looks to be the best solution.

Maybe ... just maybe ... I'd change the 'if (++act  last) act =
NO_MORE_DOCS' to 'if (++act  last) return doc = NO_MORE_DOCS' to avoid the
'act' assignment .. but since it will only happen once, I don't think it's
worth it.

On Thu, Jul 16, 2009 at 6:43 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 Agreed - that looks like the optimal solution.

 -Yonik
 http://www.lucidimagination.com

 On Thu, Jul 16, 2009 at 11:40 AM, Uwe Schindleru...@thetaphi.de wrote:
  OK, that makes sense: So the example of Yonik should be interpreted like
  this (I think this is the optimal solution as it does not use an
 additional
  if-clause to check if the iteration has already started):
 
 
 
  class SliceDocIdSetIterator extends DocIdSetIterator {
 
   private int doc=-1,act,last;
 
 
 
   public SliceDocIdSetIterator(int first, int last) {
 
 this.act=first-1; this.last=last;
 
   }
 
 
 
   public int docID() {
 
 return doc;
 
   }
 
 
 
   public int nextDoc() throws IOException {
 
 if (++actlast) act=NO_MORE_DOCS;
 
 return doc = act;
 
   }
 
 
 
   public int advance(int target) throws IOException {
 
 act=target;
 
 if (actlast) act=NO_MORE_DOCS;
 
 return doc = act;
 
   }
 
  }

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1742) Wrap SegmentInfos in public class


[ 
https://issues.apache.org/jira/browse/LUCENE-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732042#action_12732042
 ] 

Michael McCandless commented on LUCENE-1742:


I don't think we should make IndexWriter's ReaderPool public just yet?  Maybe 
instead we can add API to query for whether a segment has pending unflushed 
deletes?  (And fix core merge policies to use that API when deciding how to 
expungeDeletes).

 Wrap SegmentInfos in public class 
 --

 Key: LUCENE-1742
 URL: https://issues.apache.org/jira/browse/LUCENE-1742
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1742.patch, LUCENE-1742.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Wrap SegmentInfos in a public class so that subclasses of MergePolicy do not 
 need to be in the org.apache.lucene.index package.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match


[ 
https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732050#action_12732050
 ] 

Michael McCandless commented on LUCENE-1683:


Do you have a proposed fix for this...?  Or, why is RegexQuery treating the 
trailing . as a .* instead?

 RegexQuery matches terms the input regex doesn't actually match
 ---

 Key: LUCENE-1683
 URL: https://issues.apache.org/jira/browse/LUCENE-1683
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.3.2
Reporter: Trejkaz

 I was writing some unit tests for our own wrapper around the Lucene regex 
 classes, and got tripped up by something interesting.
 The regex cat. will match cats but also anything with cat and 1+ 
 following letters (e.g. cathy, catcher, ...)  It is as if there is an 
 implicit .* always added to the end of the regex.
 Here's a unit test for the behaviour I would expect myself:
 @Test
 public void testNecessity() throws Exception {
 File dir = new File(new File(System.getProperty(java.io.tmpdir)), 
 index);
 IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), 
 true);
 try {
 Document doc = new Document();
 doc.add(new Field(field, cat cats cathy, Field.Store.YES, 
 Field.Index.TOKENIZED));
 writer.addDocument(doc);
 } finally {
 writer.close();
 }
 IndexReader reader = IndexReader.open(dir);
 try {
 TermEnum terms = new RegexQuery(new Term(field, 
 cat.)).getEnum(reader);
 assertEquals(Wrong term, cats, terms.term());
 assertFalse(Should have only been one term, terms.next());
 } finally {
 reader.close();
 }
 }
 This test fails on the term check with terms.term() equal to cathy.
 Our workaround is to mangle the query like this:
 String fixed = String.format((?:%s)$, original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue


[ 
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732051#action_12732051
 ] 

Michael McCandless commented on LUCENE-1566:


OK thanks Simon; I'll commit shortly.

 Large Lucene index can hit false OOM due to Sun JRE issue
 -

 Key: LUCENE-1566
 URL: https://issues.apache.org/jira/browse/LUCENE-1566
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE-1566.patch, 
 LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch, 
 LUCENE_1566_IndexInput_Changes.patch


 This is not a Lucene issue, but I want to open this so future google
 diggers can more easily find it.
 There's this nasty bug in Sun's JRE:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546
 The gist seems to be, if you try to read a large (eg 200 MB) number of
 bytes during a single RandomAccessFile.read call, you can incorrectly
 hit OOM.  Lucene does this, with norms, since we read in one byte per
 doc per field with norms, as a contiguous array of length maxDoc().
 The workaround was a custom patch to do large file reads as several
 smaller reads.
 Background here:
   http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue


 [ 
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1566.


Resolution: Fixed

Thanks Simon!

 Large Lucene index can hit false OOM due to Sun JRE issue
 -

 Key: LUCENE-1566
 URL: https://issues.apache.org/jira/browse/LUCENE-1566
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE-1566.patch, 
 LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch, 
 LUCENE_1566_IndexInput_Changes.patch


 This is not a Lucene issue, but I want to open this so future google
 diggers can more easily find it.
 There's this nasty bug in Sun's JRE:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546
 The gist seems to be, if you try to read a large (eg 200 MB) number of
 bytes during a single RandomAccessFile.read call, you can incorrectly
 hit OOM.  Lucene does this, with norms, since we read in one byte per
 doc per field with norms, as a contiguous array of length maxDoc().
 The workaround was a custom patch to do large file reads as several
 smaller reads.
 Background here:
   http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

2009-07-16 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732060#action_12732060
 ] 

Steven Rowe commented on LUCENE-1683:
-

bq. ... why is RegexQuery treating the trailing . as a .* instead? 

JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), 
which is equivalent to adding a trailing .*, unless you explicity append a 
$ to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing .*.

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc contract on RegexCapabilities.match() just says 
@return true if string matches the pattern last passed to compile.

The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() 
instead of lookingAt().

 RegexQuery matches terms the input regex doesn't actually match
 ---

 Key: LUCENE-1683
 URL: https://issues.apache.org/jira/browse/LUCENE-1683
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.3.2
Reporter: Trejkaz

 I was writing some unit tests for our own wrapper around the Lucene regex 
 classes, and got tripped up by something interesting.
 The regex cat. will match cats but also anything with cat and 1+ 
 following letters (e.g. cathy, catcher, ...)  It is as if there is an 
 implicit .* always added to the end of the regex.
 Here's a unit test for the behaviour I would expect myself:
 @Test
 public void testNecessity() throws Exception {
 File dir = new File(new File(System.getProperty(java.io.tmpdir)), 
 index);
 IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), 
 true);
 try {
 Document doc = new Document();
 doc.add(new Field(field, cat cats cathy, Field.Store.YES, 
 Field.Index.TOKENIZED));
 writer.addDocument(doc);
 } finally {
 writer.close();
 }
 IndexReader reader = IndexReader.open(dir);
 try {
 TermEnum terms = new RegexQuery(new Term(field, 
 cat.)).getEnum(reader);
 assertEquals(Wrong term, cats, terms.term());
 assertFalse(Should have only been one term, terms.next());
 } finally {
 reader.close();
 }
 }
 This test fails on the term check with terms.term() equal to cathy.
 Our workaround is to mangle the query like this:
 String fixed = String.format((?:%s)$, original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

2009-07-16 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732060#action_12732060
 ] 

Steven Rowe edited comment on LUCENE-1683 at 7/16/09 11:12 AM:
---

bq. ... why is RegexQuery treating the trailing . as a .* instead? 

JavaUtilRegexCapabilities.match() is implemented as 
j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing .*, 
unless you explicity append a $ to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing .*.

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc contract on RegexCapabilities.match() just says 
@return true if string matches the pattern last passed to compile.

The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() 
instead of lookingAt().

  was (Author: steve_rowe):
bq. ... why is RegexQuery treating the trailing . as a .* instead? 

JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), 
which is equivalent to adding a trailing .*, unless you explicity append a 
$ to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing .*.

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc contract on RegexCapabilities.match() just says 
@return true if string matches the pattern last passed to compile.

The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() 
instead of lookingAt().
  
 RegexQuery matches terms the input regex doesn't actually match
 ---

 Key: LUCENE-1683
 URL: https://issues.apache.org/jira/browse/LUCENE-1683
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.3.2
Reporter: Trejkaz

 I was writing some unit tests for our own wrapper around the Lucene regex 
 classes, and got tripped up by something interesting.
 The regex cat. will match cats but also anything with cat and 1+ 
 following letters (e.g. cathy, catcher, ...)  It is as if there is an 
 implicit .* always added to the end of the regex.
 Here's a unit test for the behaviour I would expect myself:
 @Test
 public void testNecessity() throws Exception {
 File dir = new File(new File(System.getProperty(java.io.tmpdir)), 
 index);
 IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), 
 true);
 try {
 Document doc = new Document();
 doc.add(new Field(field, cat cats cathy, Field.Store.YES, 
 Field.Index.TOKENIZED));
 writer.addDocument(doc);
 } finally {
 writer.close();
 }
 IndexReader reader = IndexReader.open(dir);
 try {
 TermEnum terms = new RegexQuery(new Term(field, 
 cat.)).getEnum(reader);
 assertEquals(Wrong term, cats, terms.term());
 assertFalse(Should have only been one term, terms.next());
 } finally {
 reader.close();
 }
 }
 This test fails on the term check with terms.term() equal to cathy.
 Our workaround is to mangle the query like this:
 String fixed = String.format((?:%s)$, original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1728) Move SmartChineseAnalyzer resources to own contrib project

2009-07-16 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1728:


Attachment: LUCENE-1728.txt

Simon, I revised the patch. Here are the new instructions for the 
analyzers/common and analyzers/smartcn scheme.
Sorry for the delay.

{code}
## 1. clean svn checkout
## 2. run the following commands to refactor the files.

mkdir contrib/analyzers/common
mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn 
contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn 
contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
svn add contrib/analyzers/smartcn contrib/analyzers/common
svn move 
contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java
 contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* 
contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java 
contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart
svn move 
contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java
 contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn
svn move 
contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt 
contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
svn move 
contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* 
contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn
svn move 
contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java
 
contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java
svn move contrib/analyzers/build.xml contrib/analyzers/common
svn move contrib/analyzers/pom.xml.template contrib/analyzers/common
svn move contrib/analyzers/src contrib/analyzers/common

## 3. eclipse refresh at project level.
## 4. set text-file encoding at project level to UTF-8
## 5. manually force text-file encoding as UTF-8 for 
contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html
##   this is an existing encoding issue that is corrected by this patch.
## 6. apply patch from clipboard (you may now remove the above hack and you 
will notice this file is now detected properly as UTF-8)
{code}


 Move SmartChineseAnalyzer  resources to own contrib project
 

 Key: LUCENE-1728
 URL: https://issues.apache.org/jira/browse/LUCENE-1728
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1728.txt, LUCENE-1728.txt


 SmartChineseAnalyzer depends on  a large dictionary that causes the analyzer 
 jar to grow up to 3MB. The dictionary is quite big compared to all the other 
 resouces / class files contained in that jar. 
 Having a separate analyzer-cn contrib project enables footprint-sensitive 
 users (e.g. using lucene on a mobile phone) to include analyzer.jar without 
 getting into trouble with disk space.
 Moving SmartChineseAnalyzer to a separate project could also include a small 
 refactoring as Robert mentioned in 
 [LUCENE-1722|https://issues.apache.org/jira/browse/LUCENE-1722] several 
 classes should be package protected, members and classes could be final, 
 commented syserr and logging code should be removed etc.
 I set this issue target to 2.9 - if we can not make it until then feel free 
 to move it to 3.0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1728) Move SmartChineseAnalyzer resources to own contrib project

2009-07-16 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1728:


Attachment: LUCENE-1728.txt

same patch, but this time i clicked ASF license... sorry!

 Move SmartChineseAnalyzer  resources to own contrib project
 

 Key: LUCENE-1728
 URL: https://issues.apache.org/jira/browse/LUCENE-1728
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1728.txt, LUCENE-1728.txt, LUCENE-1728.txt


 SmartChineseAnalyzer depends on  a large dictionary that causes the analyzer 
 jar to grow up to 3MB. The dictionary is quite big compared to all the other 
 resouces / class files contained in that jar. 
 Having a separate analyzer-cn contrib project enables footprint-sensitive 
 users (e.g. using lucene on a mobile phone) to include analyzer.jar without 
 getting into trouble with disk space.
 Moving SmartChineseAnalyzer to a separate project could also include a small 
 refactoring as Robert mentioned in 
 [LUCENE-1722|https://issues.apache.org/jira/browse/LUCENE-1722] several 
 classes should be package protected, members and classes could be final, 
 commented syserr and logging code should be removed etc.
 I set this issue target to 2.9 - if we can not make it until then feel free 
 to move it to 3.0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1749) FieldCache introspection API

FieldCache introspection API


 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor


FieldCache should expose an Expert level API for runtime introspection of the 
FieldCache to provide info about what is in the FieldCache at any given moment. 
 We should also provide utility methods for sanity checking that the FieldCache 
doesn't contain anything odd...
   * entries for the same reader/field with different types/parsers
   * entries for the same field/type/parser in a reader and it's subreader(s)
   * etc...




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1749) FieldCache introspection API

[
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732110#action_12732110
]

Hoss Man commented on LUCENE-1749:
--

The motivation for this issue is all of the changes coming in 2.9 in how Lucene
internally uses the FieldCache API -- the biggest change being per Segment
sorting, but there may be others not immediately obvious.

While these changes are backwards compatible from an API and functionality
perspective, they could have some pretty serious performance impacts for
existing apps that also use the FieldCache directly and after upgrading the
apps suddenly seem slower to start (because of redundant FieldCache
initialization) and require 2X as much RAM as they did before. This could lead
people people to assume Lucene has suddenly became a major memory hog.
SOLR- and SOLR-1247 are some quick examples of the types of problems that
apps could encounter.

Currently the only way for a User to even notice the problem is to do memory
profiling, and the FieldCache data structure isn't the easiest to understand.
It would be a lot nicer to have some methods for doing this inspection
programaticly, so users could write automated tests for incorrect/redundent
usage.

FieldCache introspection API

Key: LUCENE-1749
URL: https://issues.apache.org/jira/browse/LUCENE-1749
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Hoss Man
Priority: Minor

FieldCache should expose an Expert level API for runtime introspection of the
FieldCache to provide info about what is in the FieldCache at any given
moment. We should also provide utility methods for sanity checking that the
FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1749) FieldCache introspection API


 [ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1749:
-

Attachment: fieldcache-introspection.patch

Here's the start of a patch to provide this functionality -- it just provides a 
new method/datastructure for inspecting the cache; the sanity checking utility 
methods should be straightforward assuming people think this is a good idea.

The new method itself is fairly simple, but quite a bit of refactoring to how 
the caches are managed was necessary to make it possible to implement the 
method sanely.  These changes to the FieldCache internals seem like they are 
generally a good idea from a maintenance standpoint even if people don't like 
the new method.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Attachments: fieldcache-introspection.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1749) FieldCache introspection API


 [ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1749:
-

Lucene Fields: [New, Patch Available]  (was: [New])
Fix Version/s: 2.9

Technically this isn't a bug, so i probably shouldn't add it to the 2.9 blocker 
list, but i really think it would be a good idea to have something like this in 
the 2.9 release.

At the very least: i'd like to put it on the list until/unless there is 
consensus that it's not needed.


 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1749) FieldCache introspection API


[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732123#action_12732123
 ] 

Mark Miller commented on LUCENE-1749:
-

nice - would be great if it could estimate ram usage as well.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1742) Wrap SegmentInfos in public class

2009-07-16 Thread Jason Rutherglen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1742:
-

Attachment: LUCENE-1742.patch

* Reader pool isn't public anymore

* Left methods of reader as public (could roll back?)

* I'd rather that readerpool be public, however since it's new I
guess we don't want people relying on it?

* All tests pass

* It would be great to get this into 2.9

 Wrap SegmentInfos in public class 
 --

 Key: LUCENE-1742
 URL: https://issues.apache.org/jira/browse/LUCENE-1742
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1742.patch, LUCENE-1742.patch, LUCENE-1742.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Wrap SegmentInfos in a public class so that subclasses of MergePolicy do not 
 need to be in the org.apache.lucene.index package.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1749) FieldCache introspection API


[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732157#action_12732157
 ] 

Michael McCandless commented on LUCENE-1749:


+1 -- this'd be great to get into 2.9.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1749) FieldCache introspection API

[
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732166#action_12732166
]

Uwe Schindler commented on LUCENE-1749:
---

Looks good as a start, one question about a comment:

What do you mean with:
* :TODO: is the int sort type still needed? ... doesn't seem to be used
anywhere, code just tests custom for SortComparator vs Parser.

I do not understand, do you want to remove the IntCache? What is different with
it in comparison with the other ones?

Uwe

FieldCache introspection API

Key: LUCENE-1749
URL: https://issues.apache.org/jira/browse/LUCENE-1749
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Hoss Man
Priority: Minor
Fix For: 2.9

Attachments: fieldcache-introspection.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1749) FieldCache introspection API

[
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732190#action_12732190
]

Hoss Man commented on LUCENE-1749:
--

bq. :TODO: is the int sort type still needed? ... doesn't seem to be used
anywhere, code just tests custom for SortComparator vs Parser.

sorry ... badly placed quotes ... that was in referent to Entry.type.

Until i changed getStrings, getStringIndex, and getAuto to construct Entry
objects as part of my refactoring the type attribute (and the constructor
that takes a type argument) didnt' seem to be used anywhere (as far as i could
tell)

My guess: maybe some previous changes refactored logic that switched on type up
into the SortFields?, so the FieldCache no longer needs to care about it?

FieldCache introspection API

Key: LUCENE-1749
URL: https://issues.apache.org/jira/browse/LUCENE-1749
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Hoss Man
Priority: Minor
Fix For: 2.9

Attachments: fieldcache-introspection.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1750) LogByteSizeMergePolicy doesn't keep segments under maxMergeMB

2009-07-16 Thread Jason Rutherglen (JIRA)

LogByteSizeMergePolicy doesn't keep segments under maxMergeMB
-

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9


Basically I'm trying to create largish 2-4GB shards using
LogByteSizeMergePolicy, however I've found in the attached unit
test segments that exceed maxMergeMB.

The goal is for segments to be merged up to 2GB, then all
merging to that segment stops, and then another 2GB segment is
created. This helps when replicating in Solr where if a single
optimized 60GB segment is created, the machine stops working due
to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1750) LogByteSizeMergePolicy doesn't keep segments under maxMergeMB

2009-07-16 Thread Jason Rutherglen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1750:
-

Attachment: LUCENE-1750.patch

Unit test illustrating the issue.

 LogByteSizeMergePolicy doesn't keep segments under maxMergeMB
 -

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1749) FieldCache introspection API


 [ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1749:


Attachment: LUCENE-1749.patch

Here is a start towards guessing the fieldcache ram usage.

It probably works fairly well, though it will be limited by stack space on a 
very heavily nested object graph.

I've added the size guess for getValue in the introspection output.

Its a start anyway.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch, LUCENE-1749.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1749) FieldCache introspection API


[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732297#action_12732297
 ] 

Mark Miller commented on LUCENE-1749:
-

We prob would want to provide an alternate toString that includes the ram guess 
and the default that skips it - i havn't tested performance, but it might take 
a while to check a gigantic string array.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch, LUCENE-1749.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1749) FieldCache introspection API