RE: latest lucene update

2009-07-16 Thread Uwe Schindler
Did you also test, that the speed was going back to normal with the latest
fix in trunk (without modifying Solr code)?

I ran the Solr tests with updated lucene-core-2.9.jar here, but I was not
able to find out, which of the tests had the big slowdown. I only noticed
some speedup in some tests related to search.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Thursday, July 16, 2009 2:57 AM
 To: java-dev@lucene.apache.org
 Subject: Re: latest lucene update
 
 Thanks guys, I had actually meant this message to go to solr-dev...
 hence the but I think we should implement the new methods anyway.
 I've implemented them, and the performance has returned to normal.
 
 -Yonik
 http://www.lucidimagination.com
 
 
 
 On Wed, Jul 15, 2009 at 4:00 PM, Yonik Seeleyyo...@lucidimagination.com
 wrote:
  Running solr unit tests seems a fair bit slower now.  I think the root
  cause may be this:
 
 http://search.lucidimagination.com/search/document/a8bd12c3b87e98a3/speed_
 of_booleanqueries_on_2_9
  That may be fixed, but I think we should implement the new methods
 anyway.
 
  I'm also surprised that more changes weren't necessary to get the
  latest Lucene to work... one thing in particular is docs out of order
  - Solr currently requires them in-order to correctly create DocSet
  instances, and I'm not sure this is the case any more.  I'll look into
  it.
 
  -Yonik
  http://www.lucidimagination.com
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-16 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1693:
--

Attachment: lucene-1693.patch

This is basically your last patch with these changes:

- I removed AttributeSource.setAttributeFactory(factory). Since we have the 
constructor now that takes the factory as an arg, there should be no need to 
ever change the factory after a TokenStream was created. It would also lead to 
problems regarding e.g. Tee/Sink: a user could add attributes to the Tee, then 
change the factory, then create the sink. How could we then create the same 
attribute impls for the sink? So I think the right thing to do is to not allow 
changing the factory after the stream is instantiated.

- I added the initial (untested) version of TeeSinkTokenFilter to demonstrate 
how I think it should work now. I'll finish tomorrow or Friday (add more 
javadocs and unit test). I'll also add the CachingAttributeTokenFilter, which 
is essentially almost the same as the new inner class of TeeSinkTokenFilter. 
When I have CATF the inner class can probably just extend it.

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 

[jira] Updated: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue

2009-07-16 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1566:


Attachment: LUCENE_1566_IndexInput_Changes.patch

* Set chunkSize to Integer.MAX_VALUE on 64 Bit JVM
* Removed 64bit JVM condition as chunkSize is set to maximum in 64bit case
* Added CHANGES.TXT to patch

@Mike: once you commit this change I will close this issue.

Simon

 Large Lucene index can hit false OOM due to Sun JRE issue
 -

 Key: LUCENE-1566
 URL: https://issues.apache.org/jira/browse/LUCENE-1566
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, 
 LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch


 This is not a Lucene issue, but I want to open this so future google
 diggers can more easily find it.
 There's this nasty bug in Sun's JRE:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546
 The gist seems to be, if you try to read a large (eg 200 MB) number of
 bytes during a single RandomAccessFile.read call, you can incorrectly
 hit OOM.  Lucene does this, with norms, since we read in one byte per
 doc per field with norms, as a contiguous array of length maxDoc().
 The workaround was a custom patch to do large file reads as several
 smaller reads.
 Background here:
   http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1747) Contrib/Spatial needs code cleanup before release

2009-07-16 Thread Simon Willnauer (JIRA)
Contrib/Spatial needs code cleanup before release
-

 Key: LUCENE-1747
 URL: https://issues.apache.org/jira/browse/LUCENE-1747
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9


I had a brief look at the spatial sources and found that there are quite a 
couple of warnings, main methods, loggers, immutable classes not having final 
members, unused variables, unused methodes etc.
Once mike has commited https://issues.apache.org/jira/browse/LUCENE-1505 I will 
start cleaning this  up a bit. 
It seem that there are not many unit test in this project either I might open 
an issue for 3.0 / 3.1 later though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1567) New flexible query parser

2009-07-16 Thread Adriano Crestani (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adriano Crestani updated LUCENE-1567:
-

Attachment: lucene_trunk_FlexQueryParser_2009july16_v7.patch

Here are some updates for the new query parser:

- support to set the minimum fuzzy similarity was added to the configuration 
handler

- get methods were added to the configuration handler, so the user that is used 
to the old query parser can easily access the configuration in the old way

- renamed everything referencing lucene2 to original

- removed one author tag

- improved javadoc documentation

- added a constructor to LuceneQueryParserHelper that accepts an Analyzer as 
argument, I think Lucene users are used to create a query parser and also pass 
the analyzer

That's it :)

I have also noticed that when building using ant build-contrib it does not 
copy .properties files to the jar. The new query parser uses a property file to 
read the NLS messages from and I'm getting some message warnings when running 
the tests. Is anybody getting the same warnings?

 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Grant Ingersoll
 Fix For: 2.9

 Attachments: lucene_1567_adriano_crestani_07_13_2009.patch, 
 lucene_trunk_FlexQueryParser_2009July09_v4.patch, 
 lucene_trunk_FlexQueryParser_2009July10_v5.patch, 
 lucene_trunk_FlexQueryParser_2009july15_v6.patch, 
 lucene_trunk_FlexQueryParser_2009july16_v7.patch, 
 lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26_v3.patch, new_query_parser_src.tar, 
 QueryParser_restructure_meetup_june2009_v2.pdf


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in 

[jira] Commented: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue

2009-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731880#action_12731880
 ] 

Michael McCandless commented on LUCENE-1566:


SimpleFSDirectory is missing from the last patch?

 Large Lucene index can hit false OOM due to Sun JRE issue
 -

 Key: LUCENE-1566
 URL: https://issues.apache.org/jira/browse/LUCENE-1566
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, 
 LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch


 This is not a Lucene issue, but I want to open this so future google
 diggers can more easily find it.
 There's this nasty bug in Sun's JRE:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546
 The gist seems to be, if you try to read a large (eg 200 MB) number of
 bytes during a single RandomAccessFile.read call, you can incorrectly
 hit OOM.  Lucene does this, with norms, since we read in one byte per
 doc per field with norms, as a contiguous array of length maxDoc().
 The workaround was a custom patch to do large file reads as several
 smaller reads.
 Background here:
   http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue

2009-07-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731886#action_12731886
 ] 

Simon Willnauer commented on LUCENE-1566:
-

bq. SimpleFSDirectory is missing from the last patch? 

ups! :)


 Large Lucene index can hit false OOM due to Sun JRE issue
 -

 Key: LUCENE-1566
 URL: https://issues.apache.org/jira/browse/LUCENE-1566
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, 
 LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch, 
 LUCENE_1566_IndexInput_Changes.patch


 This is not a Lucene issue, but I want to open this so future google
 diggers can more easily find it.
 There's this nasty bug in Sun's JRE:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546
 The gist seems to be, if you try to read a large (eg 200 MB) number of
 bytes during a single RandomAccessFile.read call, you can incorrectly
 hit OOM.  Lucene does this, with norms, since we read in one byte per
 doc per field with norms, as a contiguous array of length maxDoc().
 The workaround was a custom patch to do large file reads as several
 smaller reads.
 Background here:
   http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731893#action_12731893
 ] 

Uwe Schindler commented on LUCENE-1693:
---

Ok looks good. I think you will go to bed now, so the work would not collide. 
If you start to program again, ask me, that I will post a patch (which makes 
merging simplier). TortoiseSVN has a problem with merging added files, so when 
applying your patch I have to remove them first :-(

Some comments:
- TeeSinkTokenFilter looks good, I think we should also add a test for it (in 
principle the version of TestTeeTokenFilter from current trunk, not the one 
reverted to old API from the current patch)
- I do not understand completely why this WeakReference is needed between Tee 
and Sink? If it is needed, the code may fail with NPE, when Reference.get() 
returns null. The idea is, that one can create a Sink for the Tee and throw the 
Sink away. Tee would then simply not pass the attributes anymore to the sink? 
If this is the case, the check for Reference.get()==null is really missing.
- Should I implement CachingAttributesFilter as replacement for 
CachingTokenFilter, or will you do it together with TeeSink?

I will now start to add all the finals to the missing core analyzers.

bq. The only small performance improvement we should probably make is to avoid 
checking which method in TokenStream is overridden when onlyUseNewAPI==true

I could disable this for next() and next(Token). In the case of incrementToken, 
it should really check, that it is enabled, because not doing so would fail 
hard create endless loops. So the check should be there in all cases. But if 
onlyUseNewAPI is enabled, I could simply define hasNext and 
hasReusableNext=false. I will do this.

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one 

[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731896#action_12731896
 ] 

Grant Ingersoll commented on LUCENE-1693:
-

Favor to ask, when this is ready to commit, can you give a few days notice so 
that the rest of us can look at it before committing?  I've been keeping up 
with the comments, but not the patches.

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To 

Re: Search in non-linguistic text

2009-07-16 Thread JesL

Ack...  Clicked on the wrong group.  Sorry - I'll move it.
-- 
View this message in context: 
http://www.nabble.com/Search-in-non-linguistic-text-tp24515712p24515926.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans should be abstract

2009-07-16 Thread Hugh Cayless (JIRA)
getPayloadSpans on org.apache.lucene.search.spans should be abstract


 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4.1, 2.4
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


I just spent a long time tracking down a bug resulting from upgrading to Lucene 
2.4.1 on a project that implements some SpanQuerys of its own and was written 
against 2.3.  Since the project's SpanQuerys didn't implement getPayloadSpans, 
the call to that method went to SpanQuery.getPayloadSpans which returned null 
and caused a NullPointerException in the Lucene code, far away from the actual 
source of the problem.  

It would be much better for this kind of thing to show up at compile time, I 
think.

Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Hugh Cayless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hugh Cayless updated LUCENE-1748:
-

Summary: getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should 
be abstract  (was: getPayloadSpans on org.apache.lucene.search.spans should be 
abstract)

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731939#action_12731939
 ] 

Earwin Burrfoot commented on LUCENE-1748:
-

bq. Shouldnt it throw a runtime exception (unsupported operation?) or something?
What is the difference between adding an abstract method and adding a method 
that throws exception in regards to jar drop in back compat?
In both cases when you drop your new jar in you get an exception, except in the 
latter case exception is deferred.

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Hugh Cayless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731940#action_12731940
 ] 

Hugh Cayless commented on LUCENE-1748:
--

Ah.  I figured it would be something like that.  Yes, if abstract isn't 
possible, an UnsupportedOperationException would at least get closer to the 
source of the problem.

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Hugh Cayless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731940#action_12731940
 ] 

Hugh Cayless edited comment on LUCENE-1748 at 7/16/09 6:43 AM:
---

Ah.  I figured it would be something like that.  Yes, if abstract isn't 
possible, an UnsupportedOperationException would at least get closer to the 
source of the problem.

From my perspective at least, backwards compatibility is already broken, since 
Lucene doesn't work with SpanQuerys that don't implement getPayloadSpans--but 
I understand y'all will have different requirements in this regard.

  was (Author: hcayless):
Ah.  I figured it would be something like that.  Yes, if abstract isn't 
possible, an UnsupportedOperationException would at least get closer to the 
source of the problem.
  
 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Mark Miller
bq. Shouldnt it throw a runtime exception (unsupported operation?) or
something?
What is the difference between adding an abstract method and adding a
method that throws exception in regards to jar drop in back compat?
In both cases when you drop your new jar in you get an exception, except
in the latter case exception is deferred.

Yeah, its dicey - I suppose the idea is that, if you used the code as you
used to, it wouldnt try and call getPayloadSpans? And so if you kept using
non payloadspans functionality, you would be set, and if you tried to use
payloadspans you would get an exception saying the class needed to be
updated? But if you make it abstract, we lose jar drop (I know I've read we
don't have it for this release anyway) in and everyone has to implement the
method. At least with the exception, if you are using the class as you used
to, you can continue to do so with no work? Not that I 've considered it for
very long at the moment.

I know, I see your point - this back compat stuff is always dicey - thats
why I throw it out there with a question mark - hopefully others will
continue to chime in.

On Thu, Jul 16, 2009 at 9:38 AM, Earwin Burrfoot (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731939#action_12731939]

 Earwin Burrfoot commented on LUCENE-1748:
 -

 bq. Shouldnt it throw a runtime exception (unsupported operation?) or
 something?
 What is the difference between adding an abstract method and adding a
 method that throws exception in regards to jar drop in back compat?
 In both cases when you drop your new jar in you get an exception, except in
 the latter case exception is deferred.

  getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be
 abstract
 
 --
 
  Key: LUCENE-1748
  URL: https://issues.apache.org/jira/browse/LUCENE-1748
  Project: Lucene - Java
   Issue Type: Bug
   Components: Query/Scoring
 Affects Versions: 2.4, 2.4.1
  Environment: all
 Reporter: Hugh Cayless
  Fix For: 2.4.2
 
 
  I just spent a long time tracking down a bug resulting from upgrading to
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was
 written against 2.3.  Since the project's SpanQuerys didn't implement
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans
 which returned null and caused a NullPointerException in the Lucene code,
 far away from the actual source of the problem.
  It would be much better for this kind of thing to show up at compile
 time, I think.
  Thanks!

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-- 
-- 
- Mark

http://www.lucidimagination.com


Re: latest lucene update

2009-07-16 Thread Yonik Seeley
On Thu, Jul 16, 2009 at 2:11 AM, Uwe Schindleru...@thetaphi.de wrote:
 Did you also test, that the speed was going back to normal with the latest
 fix in trunk (without modifying Solr code)?

I didn't - I was already part way through implementing advance() in Solr.
I'm sure the advance() fix in Lucene would have worked too though.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731947#action_12731947
 ] 

Uwe Schindler commented on LUCENE-1693:
---

I forgot: I also implemented the final next() methods in all non-final classes.

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class hierarchy of the passed in object and finds all interfaces
   that the class or superclasses implement and that extend the
   Attribute interface. It then adds the interface-instance mappings
   to the attribute map for each of the found interfaces.
 - AttributeImpl now has a default implementation of toString that uses
   reflection to print out the values of the attributes in a default
   formatting. This makes it a bit easier to implement AttributeImpl,
   because toString() was declared abstract before.
 - Cloning is now done much more efficiently in
   captureState. The method figures out which unique AttributeImpl
   instances are contained as values in the attributes map, because
   those are the ones that need to be cloned. It creates a single
   linked list that supports deep cloning (in the inner class
   AttributeSource.State). AttributeSource keeps track of when this
   state changes, i.e. whenever new attributes are added to the
   AttributeSource. Only in that case will captureState recompute the
   state, otherwise it will simply clone the precomputed state and
   return the clone. restoreState(AttributeSource.State) walks the
   linked list and uses the copyTo() method of AttributeImpl to copy
   all values over into the attribute that the source stream
   (e.g. SinkTokenizer) uses. 
 The cloning performance can be greatly improved if not multiple
 AttributeImpl instances are used in one TokenStream. A user can
 e.g. simply add a Token instance to the stream instead of the individual
 attributes. Or the user could implement a subclass of AttributeImpl that
 implements exactly the Attribute interfaces needed. I think this
 should be considered an expert API (addAttributeImpl), as this manual
 optimization is only needed if cloning performance is crucial. I ran
 some quick performance tests using Tee/Sink tokenizers (which do
 cloning) and the performance was roughly 20% faster with the new
 API. I'll run some more performance tests and post more numbers then.
 Note also that when we add serialization to the Attributes, e.g. for
 supporting storing serialized TokenStreams in the index, then the
 serialization should benefit even significantly more from the new API
 than cloning. 
 Also, the TokenStream API does not change, except for the removal 
 of the set/getUseNewAPI methods. So the patches in LUCENE-1460
 should still work.
 All core tests pass, however, I need to update all the documentation
 and also add some unit tests for the new AttributeSource
 functionality. So this patch is not ready to commit yet, but I wanted
 to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements

2009-07-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1693:
--

Attachment: LUCENE-1693.patch

New patch with some more work. First the phantastic news:

As CachingTokenFilter has no API to access the cached attributes/tokens 
directly, it does not need to be deprecated, it just switched the internal and 
hidden impl to incrementToken() and attributes. I also added an additional test 
in the BW-Testcase, that checks if the caching also works for your strange 
POSTokens. And it works! You can even mix the consumers, e.g. first use new API 
to cache tokens and then replay using the old API. really cool. The problem, 
why the POSToken was not preserved in the past was an error in 
TokenWrapper.copyTo(). This method created a new Token and copied the contents 
into it using reinit(). Now it simply creates a clone and let delegate point to 
it (this is how the caching worked before).

In principle Tee/SinkTokenizer could also work like this, the only problem with 
this class is the fact, that it has a public API that exposes the Token 
instances to the outside. Because of that, there is no way around deprecating.

Your new TeeSinkTokenFilter looks good, it only had one problem:
It used addAttributeImpl to add the attribute of the Tee to the new created 
Sink. Because of this, the sink got the same instance as the parent added. With 
useOnlyNewAPI, this does not have an effect for the standard attributes, as the 
ctor already created a Token instance as implementation and added it to the 
stream, so addAttributeImpl had no effect.
I changed this to use the getAttributeClassesIterator and added a new attribute 
instance for each attribute using addAttribute to the sink. As the factory is 
the same, the attributes are generated in the same way. TeeSinkTokenizer would 
only *not* work correctly if somebody addes an custom instance using 
addAttributeImpl in one ctor of another filter in the chain. In this case, the 
factory would create another impl and restoreState throws IAE. In backwards 
compatibility mode (default) the new created sink and also the tee have always 
the default TokenWrapper implementation, so state restoring also works. You 
only have a problem if you change useOnlyNewAPIU inbetween (which would always 
create corrupt chains).

Another idea would be to clone all attribute impls and then add them to the 
sink - the factory would then not be used?

I started to create a test for the new TeeSinkTokenFilter, but there is one 
thing missing: The original test created a subclass of SinkTokenizer, 
overriding add() to filter the tokens added to the sink. This functionality is 
missing with the new API. The correct workaround would be to plug a filter 
around the sink and filter the tokens there? The problem is then, that the 
cache always contains also non-needed tokens (the old impl would not store them 
in the sink).

Maybe we add the filter to the TeeSinkTokenFilter (getting a State, which would 
not work, as contents of state pkg-private?). Somehow else? Or leave it as it 
is and let the user plug the filter on top of the sink (I prefer this)?

 AttributeSource/TokenStream API improvements
 

 Key: LUCENE-1693
 URL: https://issues.apache.org/jira/browse/LUCENE-1693
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, 
 LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, 
 TestAPIBackwardsCompatibility.java, TestCompatibility.java, 
 TestCompatibility.java, TestCompatibility.java, TestCompatibility.java


 This patch makes the following improvements to AttributeSource and
 TokenStream/Filter:
 - removes the set/getUseNewAPI() methods (including the standard
   ones). Instead by default incrementToken() throws a subclass of
   UnsupportedOperationException. The indexer tries to call
   incrementToken() initially once to see if the exception is thrown;
   if so, it falls back to the old API.
 - introduces interfaces for all Attributes. The corresponding
   implementations have the postfix 'Impl', e.g. TermAttribute and
   TermAttributeImpl. AttributeSource now has a factory for creating
   the Attribute instances; the default implementation looks for
   implementing classes with the postfix 'Impl'. Token now implements
   all 6 TokenAttribute interfaces.
 - new method added to AttributeSource:
   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
   class 

RE: latest lucene update

2009-07-16 Thread Uwe Schindler
OK. At least I have seen a speed up during my tests :). I have the logs
somewhere. Which tests were affected negative, then I can look into the
before/after logs?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Thursday, July 16, 2009 3:53 PM
 To: java-dev@lucene.apache.org
 Subject: Re: latest lucene update
 
 On Thu, Jul 16, 2009 at 2:11 AM, Uwe Schindleru...@thetaphi.de wrote:
  Did you also test, that the speed was going back to normal with the
 latest
  fix in trunk (without modifying Solr code)?
 
 I didn't - I was already part way through implementing advance() in Solr.
 I'm sure the advance() fix in Lucene would have worked too though.
 
 -Yonik
 http://www.lucidimagination.com
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731968#action_12731968
 ] 

Mark Miller commented on LUCENE-1748:
-

My response sent to mailing list:

bq. Shouldnt it throw a runtime exception (unsupported operation?) or 
something?
What is the difference between adding an abstract method and adding a method 
that throws exception in regards to jar drop in back compat?
In both cases when you drop your new jar in you get an exception, except in 
the latter case exception is deferred.

Yeah, its dicey - I suppose the idea is that, if you used the code as you used 
to, it wouldnt try and call getPayloadSpans? And so if you kept using non 
payloadspans functionality, you would be set, and if you tried to use 
payloadspans you would get an exception saying the class needed to be updated? 
But if you make it abstract, we lose jar drop (I know I've read we don't have 
it for this release anyway) in and everyone has to implement the method. At 
least with the exception, if you are using the class as you used to, you can 
continue to do so with no work? Not that I 've considered it for very long at 
the moment.

I know, I see your point - this back compat stuff is always dicey - thats why I 
throw it out there with a question mark - hopefully others will continue to 
chime in.
- Show quoted text -

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731972#action_12731972
 ] 

Earwin Burrfoot commented on LUCENE-1748:
-

I took a glance at the code, the whole getPayloadSpans deal is a herecy.

Each and every implementation looks like:
  public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException {
return (PayloadSpans) getSpans(reader);
  }

Moving it to the base SpanQuery is broken equally to current solution, but 
yields much less strange copypaste.

I also have a faint feeling that if you expose a method like
ClassA method();
you can then upgrade it to
SubclassOfClassA method();
without breaking drop-in compatibility, which renders getPayloadSpans vs 
getSpans alternative totally useless

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731971#action_12731971
 ] 

Mark Miller commented on LUCENE-1748:
-

bq. From my perspective at least, backwards compatibility is already broken, 
since Lucene doesn't work with SpanQuerys that don't implement getPayloadSpans

Ah, I see - I hadn't looked at this issue in a long time. It looks like you 
must implement it to do much of anything right?

We need to address this better - perhaps abstract is the way to go.

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731972#action_12731972
 ] 

Earwin Burrfoot edited comment on LUCENE-1748 at 7/16/09 7:54 AM:
--

I took a glance at the code, the whole getPayloadSpans deal is a herecy.

Each and every implementation looks like:
  public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException {
return (PayloadSpans) getSpans(reader);
  }

Moving it to the base SpanQuery is broken equally to current solution, but 
yields much less strange copypaste.

-I also have a faint feeling that if you expose a method like-
-ClassA method();-
-you can then upgrade it to-
-SubclassOfClassA method();-
-without breaking drop-in compatibility, which renders getPayloadSpans vs 
getSpans alternative totally useless-
Ok, I'm wrong.

  was (Author: earwin):
I took a glance at the code, the whole getPayloadSpans deal is a herecy.

Each and every implementation looks like:
  public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException {
return (PayloadSpans) getSpans(reader);
  }

Moving it to the base SpanQuery is broken equally to current solution, but 
yields much less strange copypaste.

I also have a faint feeling that if you expose a method like
ClassA method();
you can then upgrade it to
SubclassOfClassA method();
without breaking drop-in compatibility, which renders getPayloadSpans vs 
getSpans alternative totally useless
  
 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731984#action_12731984
 ] 

Mark Miller commented on LUCENE-1748:
-

bq. the whole getPayloadSpans deal is a herecy.

heh. don't dig too deep - it also has to load all of the payloads as it matches 
whether you ask for them or not (if they exist).

The ordered or unordered matcher also has to load them and dump them in certain 
situation when they are not actually needed.

Lets look at what we need to do to fix this - we don't have to worry too much 
about back compat, cause its already pretty screwed I think.

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: DISI semantics

2009-07-16 Thread Shai Erera
Uwe / Yonik, DISI's class javadoc states this:

Implementations of this class are expected to consider {...@link
Integer#MAX_VALUE} as an invalid value.

Therefore last cannot be set to MAX_VAL in the above example, if it wants
to be a DISI at least.

Phew ... that was a long issue. I was able to find the conversation on -1
vs. any value before the first there:
https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12714298page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12714298

That link points to my response to Mike w/ why I think it'd be wrong to
relax the policy of docId(). You can read 1-2 comments up and down to get
the full conversation.

In short, if we don't document clearly what is returned by docId() before
the iteration started, it will be hard for a code which receives a DISI to
determine whether to call nextDoc() or start by collecting what docId()
returns. Can be worked around though, but I think the API is clear now and
does not leave room for interpretation.

Shai

On Thu, Jul 16, 2009 at 5:29 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Wed, Jul 15, 2009 at 6:55 PM, Michael
 McCandlessluc...@mikemccandless.com wrote:
  I believe we debated allowing the DISI to return any docID less than
  its first real docID, not only -1, as you've done here, but I think
  Shai found something wrong with that IIRC... but I can't find this
  discussion.  Shai do you remember / can you find this past discussion
  / am I just hallucinating?

 I don't know if it exists in Lucene, but I guess I can see the benefit
 of only having -1 or NO_MORE_DOCS.
 Consider a simplified ConjunctionScorer that didn't do anything in the
 constructor but simply skipped one iterator and then did the logic of
 doNext() until they all matched.  One could get a false hit with my
 theoretical SliceDocIdSetIterator above.

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract

2009-07-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731979#action_12731979
 ] 

Mark Miller commented on LUCENE-1748:
-

Okay, so it says: Implementing classes that want access to the payloads will 
need to implement this.

But in reality, if you don't implement it, looks like your screwed if you add 
it to the container SpanQueries. whether you access the payloads or not.

 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
 --

 Key: LUCENE-1748
 URL: https://issues.apache.org/jira/browse/LUCENE-1748
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.4.1
 Environment: all
Reporter: Hugh Cayless
 Fix For: 2.4.2


 I just spent a long time tracking down a bug resulting from upgrading to 
 Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was 
 written against 2.3.  Since the project's SpanQuerys didn't implement 
 getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans 
 which returned null and caused a NullPointerException in the Lucene code, far 
 away from the actual source of the problem.  
 It would be much better for this kind of thing to show up at compile time, I 
 think.
 Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue

2009-07-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1566:
---

Attachment: LUCENE-1566.patch

OK I reworked the patch some, tweaking javadocs, changes, etc., and
simplifying the loops that read the bytes inside NIOFSDir 
SimpleFSDir.  I think it's ready to commit.  Simon can you take a
look?  Thanks.


 Large Lucene index can hit false OOM due to Sun JRE issue
 -

 Key: LUCENE-1566
 URL: https://issues.apache.org/jira/browse/LUCENE-1566
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE-1566.patch, 
 LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch, 
 LUCENE_1566_IndexInput_Changes.patch


 This is not a Lucene issue, but I want to open this so future google
 diggers can more easily find it.
 There's this nasty bug in Sun's JRE:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546
 The gist seems to be, if you try to read a large (eg 200 MB) number of
 bytes during a single RandomAccessFile.read call, you can incorrectly
 hit OOM.  Lucene does this, with norms, since we read in one byte per
 doc per field with norms, as a contiguous array of length maxDoc().
 The workaround was a custom patch to do large file reads as several
 smaller reads.
 Background here:
   http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1505) Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils

2009-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731993#action_12731993
 ] 

Michael McCandless commented on LUCENE-1505:


bq. For completeness, shoudl we also add them for the ones with the shift value 
at the end? an char[]? I was reluctant to do this.

Let's hold off  add these when the need first arises?

bq. I wonder if it would make sense to do some cleanup in the code (final vars 
and args etc.) and if we should remove this logging code

Agreed -- looks like you've opened a new issue for this already; thanks!

I'll commit shortly.

 Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils
 -

 Key: LUCENE-1505
 URL: https://issues.apache.org/jira/browse/LUCENE-1505
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Reporter: Ryan McKinley
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1505.patch


 Currently spatial contrib includes a copy of NumberUtils from solr (otherwise 
 it would depend on solr)
 Once LUCENE-1496 is sorted out, this copy should be removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: DISI semantics

2009-07-16 Thread Uwe Schindler
OK, that makes sense: So the example of Yonik should be interpreted like
this (I think this is the optimal solution as it does not use an additional
if-clause to check if the iteration has already started):

 

class SliceDocIdSetIterator extends DocIdSetIterator {

 private int doc=-1,act,last;

 

 public SliceDocIdSetIterator(int first, int last) {

   this.act=first-1; this.last=last;

 }

 

 public int docID() {

   return doc;

 }

 

 public int nextDoc() throws IOException {

   if (++actlast) act=NO_MORE_DOCS;

   return doc = act;

 }

 

 public int advance(int target) throws IOException {

   act=target;

   if (actlast) act=NO_MORE_DOCS;

   return doc = act;

 }

}

 

 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: Shai Erera [mailto:ser...@gmail.com] 
Sent: Thursday, July 16, 2009 5:04 PM
To: java-dev@lucene.apache.org; yo...@lucidimagination.com
Subject: Re: DISI semantics

 

Uwe / Yonik, DISI's class javadoc states this:

Implementations of this class are expected to consider {...@link
Integer#MAX_VALUE} as an invalid value.

Therefore last cannot be set to MAX_VAL in the above example, if it wants
to be a DISI at least.

Phew ... that was a long issue. I was able to find the conversation on -1
vs. any value before the first there:
https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12714298
https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12714298
page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#act
ion_12714298
page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#act
ion_12714298

That link points to my response to Mike w/ why I think it'd be wrong to
relax the policy of docId(). You can read 1-2 comments up and down to get
the full conversation.

In short, if we don't document clearly what is returned by docId() before
the iteration started, it will be hard for a code which receives a DISI to
determine whether to call nextDoc() or start by collecting what docId()
returns. Can be worked around though, but I think the API is clear now and
does not leave room for interpretation.

Shai

On Thu, Jul 16, 2009 at 5:29 PM, Yonik Seeley yo...@lucidimagination.com
wrote:

On Wed, Jul 15, 2009 at 6:55 PM, Michael
McCandlessluc...@mikemccandless.com wrote:
 I believe we debated allowing the DISI to return any docID less than
 its first real docID, not only -1, as you've done here, but I think
 Shai found something wrong with that IIRC... but I can't find this
 discussion.  Shai do you remember / can you find this past discussion
 / am I just hallucinating?

I don't know if it exists in Lucene, but I guess I can see the benefit
of only having -1 or NO_MORE_DOCS.
Consider a simplified ConjunctionScorer that didn't do anything in the
constructor but simply skipped one iterator and then did the logic of
doNext() until they all matched.  One could get a false hit with my
theoretical SliceDocIdSetIterator above.


-Yonik
http://www.lucidimagination.com

-

To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

 



[jira] Resolved: (LUCENE-1505) Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils

2009-07-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1505.


Resolution: Fixed

 Change contrib/spatial to use trie's NumericUtils, and remove NumberUtils
 -

 Key: LUCENE-1505
 URL: https://issues.apache.org/jira/browse/LUCENE-1505
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Reporter: Ryan McKinley
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1505.patch


 Currently spatial contrib includes a copy of NumberUtils from solr (otherwise 
 it would depend on solr)
 Once LUCENE-1496 is sorted out, this copy should be removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: DISI semantics

2009-07-16 Thread Yonik Seeley
Agreed - that looks like the optimal solution.

-Yonik
http://www.lucidimagination.com

On Thu, Jul 16, 2009 at 11:40 AM, Uwe Schindleru...@thetaphi.de wrote:
 OK, that makes sense: So the example of Yonik should be interpreted like
 this (I think this is the optimal solution as it does not use an additional
 if-clause to check if the iteration has already started):



 class SliceDocIdSetIterator extends DocIdSetIterator {

  private int doc=-1,act,last;



  public SliceDocIdSetIterator(int first, int last) {

    this.act=first-1; this.last=last;

  }



  public int docID() {

    return doc;

  }



  public int nextDoc() throws IOException {

    if (++actlast) act=NO_MORE_DOCS;

    return doc = act;

  }



  public int advance(int target) throws IOException {

    act=target;

    if (actlast) act=NO_MORE_DOCS;

    return doc = act;

  }

 }

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: DISI semantics

2009-07-16 Thread Shai Erera
Of course - if you don't plan to push this DISI into uncontrolled land,
you can use the previous solution as well. I.e., if you never rely on docId
to know whether to start the iteration, and don't pass this DISI to Lucene
somehow etc., there's no need to use act or adhere completely to the API.

Otherwise, I agree, this looks to be the best solution.

Maybe ... just maybe ... I'd change the 'if (++act  last) act =
NO_MORE_DOCS' to 'if (++act  last) return doc = NO_MORE_DOCS' to avoid the
'act' assignment .. but since it will only happen once, I don't think it's
worth it.

On Thu, Jul 16, 2009 at 6:43 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 Agreed - that looks like the optimal solution.

 -Yonik
 http://www.lucidimagination.com

 On Thu, Jul 16, 2009 at 11:40 AM, Uwe Schindleru...@thetaphi.de wrote:
  OK, that makes sense: So the example of Yonik should be interpreted like
  this (I think this is the optimal solution as it does not use an
 additional
  if-clause to check if the iteration has already started):
 
 
 
  class SliceDocIdSetIterator extends DocIdSetIterator {
 
   private int doc=-1,act,last;
 
 
 
   public SliceDocIdSetIterator(int first, int last) {
 
 this.act=first-1; this.last=last;
 
   }
 
 
 
   public int docID() {
 
 return doc;
 
   }
 
 
 
   public int nextDoc() throws IOException {
 
 if (++actlast) act=NO_MORE_DOCS;
 
 return doc = act;
 
   }
 
 
 
   public int advance(int target) throws IOException {
 
 act=target;
 
 if (actlast) act=NO_MORE_DOCS;
 
 return doc = act;
 
   }
 
  }

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1742) Wrap SegmentInfos in public class

2009-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732042#action_12732042
 ] 

Michael McCandless commented on LUCENE-1742:


I don't think we should make IndexWriter's ReaderPool public just yet?  Maybe 
instead we can add API to query for whether a segment has pending unflushed 
deletes?  (And fix core merge policies to use that API when deciding how to 
expungeDeletes).

 Wrap SegmentInfos in public class 
 --

 Key: LUCENE-1742
 URL: https://issues.apache.org/jira/browse/LUCENE-1742
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1742.patch, LUCENE-1742.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Wrap SegmentInfos in a public class so that subclasses of MergePolicy do not 
 need to be in the org.apache.lucene.index package.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

2009-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732050#action_12732050
 ] 

Michael McCandless commented on LUCENE-1683:


Do you have a proposed fix for this...?  Or, why is RegexQuery treating the 
trailing . as a .* instead?

 RegexQuery matches terms the input regex doesn't actually match
 ---

 Key: LUCENE-1683
 URL: https://issues.apache.org/jira/browse/LUCENE-1683
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.3.2
Reporter: Trejkaz

 I was writing some unit tests for our own wrapper around the Lucene regex 
 classes, and got tripped up by something interesting.
 The regex cat. will match cats but also anything with cat and 1+ 
 following letters (e.g. cathy, catcher, ...)  It is as if there is an 
 implicit .* always added to the end of the regex.
 Here's a unit test for the behaviour I would expect myself:
 @Test
 public void testNecessity() throws Exception {
 File dir = new File(new File(System.getProperty(java.io.tmpdir)), 
 index);
 IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), 
 true);
 try {
 Document doc = new Document();
 doc.add(new Field(field, cat cats cathy, Field.Store.YES, 
 Field.Index.TOKENIZED));
 writer.addDocument(doc);
 } finally {
 writer.close();
 }
 IndexReader reader = IndexReader.open(dir);
 try {
 TermEnum terms = new RegexQuery(new Term(field, 
 cat.)).getEnum(reader);
 assertEquals(Wrong term, cats, terms.term());
 assertFalse(Should have only been one term, terms.next());
 } finally {
 reader.close();
 }
 }
 This test fails on the term check with terms.term() equal to cathy.
 Our workaround is to mangle the query like this:
 String fixed = String.format((?:%s)$, original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue

2009-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732051#action_12732051
 ] 

Michael McCandless commented on LUCENE-1566:


OK thanks Simon; I'll commit shortly.

 Large Lucene index can hit false OOM due to Sun JRE issue
 -

 Key: LUCENE-1566
 URL: https://issues.apache.org/jira/browse/LUCENE-1566
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE-1566.patch, 
 LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch, 
 LUCENE_1566_IndexInput_Changes.patch


 This is not a Lucene issue, but I want to open this so future google
 diggers can more easily find it.
 There's this nasty bug in Sun's JRE:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546
 The gist seems to be, if you try to read a large (eg 200 MB) number of
 bytes during a single RandomAccessFile.read call, you can incorrectly
 hit OOM.  Lucene does this, with norms, since we read in one byte per
 doc per field with norms, as a contiguous array of length maxDoc().
 The workaround was a custom patch to do large file reads as several
 smaller reads.
 Background here:
   http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1566) Large Lucene index can hit false OOM due to Sun JRE issue

2009-07-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1566.


Resolution: Fixed

Thanks Simon!

 Large Lucene index can hit false OOM due to Sun JRE issue
 -

 Key: LUCENE-1566
 URL: https://issues.apache.org/jira/browse/LUCENE-1566
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1566.patch, LUCENE-1566.patch, LUCENE-1566.patch, 
 LUCENE_1566_IndexInput.patch, LUCENE_1566_IndexInput_Changes.patch, 
 LUCENE_1566_IndexInput_Changes.patch


 This is not a Lucene issue, but I want to open this so future google
 diggers can more easily find it.
 There's this nasty bug in Sun's JRE:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546
 The gist seems to be, if you try to read a large (eg 200 MB) number of
 bytes during a single RandomAccessFile.read call, you can incorrectly
 hit OOM.  Lucene does this, with norms, since we read in one byte per
 doc per field with norms, as a contiguous array of length maxDoc().
 The workaround was a custom patch to do large file reads as several
 smaller reads.
 Background here:
   http://www.nabble.com/problems-with-large-Lucene-index-td22347854.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

2009-07-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732060#action_12732060
 ] 

Steven Rowe commented on LUCENE-1683:
-

bq. ... why is RegexQuery treating the trailing . as a .* instead? 

JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), 
which is equivalent to adding a trailing .*, unless you explicity append a 
$ to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing .*.

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc contract on RegexCapabilities.match() just says 
@return true if string matches the pattern last passed to compile.

The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() 
instead of lookingAt().

 RegexQuery matches terms the input regex doesn't actually match
 ---

 Key: LUCENE-1683
 URL: https://issues.apache.org/jira/browse/LUCENE-1683
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.3.2
Reporter: Trejkaz

 I was writing some unit tests for our own wrapper around the Lucene regex 
 classes, and got tripped up by something interesting.
 The regex cat. will match cats but also anything with cat and 1+ 
 following letters (e.g. cathy, catcher, ...)  It is as if there is an 
 implicit .* always added to the end of the regex.
 Here's a unit test for the behaviour I would expect myself:
 @Test
 public void testNecessity() throws Exception {
 File dir = new File(new File(System.getProperty(java.io.tmpdir)), 
 index);
 IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), 
 true);
 try {
 Document doc = new Document();
 doc.add(new Field(field, cat cats cathy, Field.Store.YES, 
 Field.Index.TOKENIZED));
 writer.addDocument(doc);
 } finally {
 writer.close();
 }
 IndexReader reader = IndexReader.open(dir);
 try {
 TermEnum terms = new RegexQuery(new Term(field, 
 cat.)).getEnum(reader);
 assertEquals(Wrong term, cats, terms.term());
 assertFalse(Should have only been one term, terms.next());
 } finally {
 reader.close();
 }
 }
 This test fails on the term check with terms.term() equal to cathy.
 Our workaround is to mangle the query like this:
 String fixed = String.format((?:%s)$, original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

2009-07-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732060#action_12732060
 ] 

Steven Rowe edited comment on LUCENE-1683 at 7/16/09 11:12 AM:
---

bq. ... why is RegexQuery treating the trailing . as a .* instead? 

JavaUtilRegexCapabilities.match() is implemented as 
j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing .*, 
unless you explicity append a $ to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing .*.

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc contract on RegexCapabilities.match() just says 
@return true if string matches the pattern last passed to compile.

The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() 
instead of lookingAt().

  was (Author: steve_rowe):
bq. ... why is RegexQuery treating the trailing . as a .* instead? 

JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), 
which is equivalent to adding a trailing .*, unless you explicity append a 
$ to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), 
which does not imply the trailing .*.

The difference in the two implementations implies this is a kind of bug, 
especially since the javadoc contract on RegexCapabilities.match() just says 
@return true if string matches the pattern last passed to compile.

The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() 
instead of lookingAt().
  
 RegexQuery matches terms the input regex doesn't actually match
 ---

 Key: LUCENE-1683
 URL: https://issues.apache.org/jira/browse/LUCENE-1683
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.3.2
Reporter: Trejkaz

 I was writing some unit tests for our own wrapper around the Lucene regex 
 classes, and got tripped up by something interesting.
 The regex cat. will match cats but also anything with cat and 1+ 
 following letters (e.g. cathy, catcher, ...)  It is as if there is an 
 implicit .* always added to the end of the regex.
 Here's a unit test for the behaviour I would expect myself:
 @Test
 public void testNecessity() throws Exception {
 File dir = new File(new File(System.getProperty(java.io.tmpdir)), 
 index);
 IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), 
 true);
 try {
 Document doc = new Document();
 doc.add(new Field(field, cat cats cathy, Field.Store.YES, 
 Field.Index.TOKENIZED));
 writer.addDocument(doc);
 } finally {
 writer.close();
 }
 IndexReader reader = IndexReader.open(dir);
 try {
 TermEnum terms = new RegexQuery(new Term(field, 
 cat.)).getEnum(reader);
 assertEquals(Wrong term, cats, terms.term());
 assertFalse(Should have only been one term, terms.next());
 } finally {
 reader.close();
 }
 }
 This test fails on the term check with terms.term() equal to cathy.
 Our workaround is to mangle the query like this:
 String fixed = String.format((?:%s)$, original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1728) Move SmartChineseAnalyzer resources to own contrib project

2009-07-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1728:


Attachment: LUCENE-1728.txt

Simon, I revised the patch. Here are the new instructions for the 
analyzers/common and analyzers/smartcn scheme.
Sorry for the delay.

{code}
## 1. clean svn checkout
## 2. run the following commands to refactor the files.

mkdir contrib/analyzers/common
mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn 
contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn 
contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
svn add contrib/analyzers/smartcn contrib/analyzers/common
svn move 
contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java
 contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* 
contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java 
contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart
svn move 
contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java
 contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn
svn move 
contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt 
contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
svn move 
contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* 
contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn
svn move 
contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java
 
contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java
svn move contrib/analyzers/build.xml contrib/analyzers/common
svn move contrib/analyzers/pom.xml.template contrib/analyzers/common
svn move contrib/analyzers/src contrib/analyzers/common

## 3. eclipse refresh at project level.
## 4. set text-file encoding at project level to UTF-8
## 5. manually force text-file encoding as UTF-8 for 
contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html
##   this is an existing encoding issue that is corrected by this patch.
## 6. apply patch from clipboard (you may now remove the above hack and you 
will notice this file is now detected properly as UTF-8)
{code}


 Move SmartChineseAnalyzer  resources to own contrib project
 

 Key: LUCENE-1728
 URL: https://issues.apache.org/jira/browse/LUCENE-1728
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1728.txt, LUCENE-1728.txt


 SmartChineseAnalyzer depends on  a large dictionary that causes the analyzer 
 jar to grow up to 3MB. The dictionary is quite big compared to all the other 
 resouces / class files contained in that jar. 
 Having a separate analyzer-cn contrib project enables footprint-sensitive 
 users (e.g. using lucene on a mobile phone) to include analyzer.jar without 
 getting into trouble with disk space.
 Moving SmartChineseAnalyzer to a separate project could also include a small 
 refactoring as Robert mentioned in 
 [LUCENE-1722|https://issues.apache.org/jira/browse/LUCENE-1722] several 
 classes should be package protected, members and classes could be final, 
 commented syserr and logging code should be removed etc.
 I set this issue target to 2.9 - if we can not make it until then feel free 
 to move it to 3.0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1728) Move SmartChineseAnalyzer resources to own contrib project

2009-07-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1728:


Attachment: LUCENE-1728.txt

same patch, but this time i clicked ASF license... sorry!

 Move SmartChineseAnalyzer  resources to own contrib project
 

 Key: LUCENE-1728
 URL: https://issues.apache.org/jira/browse/LUCENE-1728
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1728.txt, LUCENE-1728.txt, LUCENE-1728.txt


 SmartChineseAnalyzer depends on  a large dictionary that causes the analyzer 
 jar to grow up to 3MB. The dictionary is quite big compared to all the other 
 resouces / class files contained in that jar. 
 Having a separate analyzer-cn contrib project enables footprint-sensitive 
 users (e.g. using lucene on a mobile phone) to include analyzer.jar without 
 getting into trouble with disk space.
 Moving SmartChineseAnalyzer to a separate project could also include a small 
 refactoring as Robert mentioned in 
 [LUCENE-1722|https://issues.apache.org/jira/browse/LUCENE-1722] several 
 classes should be package protected, members and classes could be final, 
 commented syserr and logging code should be removed etc.
 I set this issue target to 2.9 - if we can not make it until then feel free 
 to move it to 3.0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Hoss Man (JIRA)
FieldCache introspection API


 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor


FieldCache should expose an Expert level API for runtime introspection of the 
FieldCache to provide info about what is in the FieldCache at any given moment. 
 We should also provide utility methods for sanity checking that the FieldCache 
doesn't contain anything odd...
   * entries for the same reader/field with different types/parsers
   * entries for the same field/type/parser in a reader and it's subreader(s)
   * etc...




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732110#action_12732110
 ] 

Hoss Man commented on LUCENE-1749:
--


The motivation for this issue is all of the changes coming in 2.9 in how Lucene 
internally uses the FieldCache API -- the biggest change being per Segment 
sorting, but there may be others not immediately obvious.

While these changes are backwards compatible from an API and functionality 
perspective, they could have some pretty serious performance impacts for 
existing apps that also use the FieldCache directly and after upgrading the 
apps suddenly seem slower to start (because of redundant FieldCache 
initialization) and require 2X as much RAM as they did before.  This could lead 
people people to assume Lucene has suddenly became a major memory hog.  
SOLR- and SOLR-1247 are some quick examples of the types of problems that 
apps could encounter.

Currently the only way for a User to even notice the problem is to do memory 
profiling, and the FieldCache data structure isn't the easiest to understand.  
It would be a lot nicer to have some methods for doing this inspection 
programaticly, so users could write automated tests for incorrect/redundent 
usage.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor

 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1749:
-

Attachment: fieldcache-introspection.patch

Here's the start of a patch to provide this functionality -- it just provides a 
new method/datastructure for inspecting the cache; the sanity checking utility 
methods should be straightforward assuming people think this is a good idea.

The new method itself is fairly simple, but quite a bit of refactoring to how 
the caches are managed was necessary to make it possible to implement the 
method sanely.  These changes to the FieldCache internals seem like they are 
generally a good idea from a maintenance standpoint even if people don't like 
the new method.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Attachments: fieldcache-introspection.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1749:
-

Lucene Fields: [New, Patch Available]  (was: [New])
Fix Version/s: 2.9

Technically this isn't a bug, so i probably shouldn't add it to the 2.9 blocker 
list, but i really think it would be a good idea to have something like this in 
the 2.9 release.

At the very least: i'd like to put it on the list until/unless there is 
consensus that it's not needed.


 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732123#action_12732123
 ] 

Mark Miller commented on LUCENE-1749:
-

nice - would be great if it could estimate ram usage as well.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1742) Wrap SegmentInfos in public class

2009-07-16 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1742:
-

Attachment: LUCENE-1742.patch

* Reader pool isn't public anymore

* Left methods of reader as public (could roll back?)

* I'd rather that readerpool be public, however since it's new I
guess we don't want people relying on it?

* All tests pass

* It would be great to get this into 2.9

 Wrap SegmentInfos in public class 
 --

 Key: LUCENE-1742
 URL: https://issues.apache.org/jira/browse/LUCENE-1742
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1742.patch, LUCENE-1742.patch, LUCENE-1742.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Wrap SegmentInfos in a public class so that subclasses of MergePolicy do not 
 need to be in the org.apache.lucene.index package.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732157#action_12732157
 ] 

Michael McCandless commented on LUCENE-1749:


+1 -- this'd be great to get into 2.9.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732166#action_12732166
 ] 

Uwe Schindler commented on LUCENE-1749:
---

Looks good as a start, one question about a comment:

What do you mean with:
 * :TODO: is the int sort type still needed? ... doesn't seem to be used 
anywhere, code just tests custom for SortComparator vs Parser.

I do not understand, do you want to remove the IntCache? What is different with 
it in comparison with the other ones?

Uwe

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732190#action_12732190
 ] 

Hoss Man commented on LUCENE-1749:
--

bq. :TODO: is the int sort type still needed? ... doesn't seem to be used 
anywhere, code just tests custom for SortComparator vs Parser.

sorry ... badly placed quotes ... that was in referent to Entry.type. 

Until i changed getStrings, getStringIndex, and getAuto to construct Entry 
objects as part of my refactoring the type attribute (and the constructor 
that takes a type argument) didnt' seem to be used anywhere (as far as i could 
tell)

My guess: maybe some previous changes refactored logic that switched on type up 
into the SortFields?, so the FieldCache no longer needs to care about it?


 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1750) LogByteSizeMergePolicy doesn't keep segments under maxMergeMB

2009-07-16 Thread Jason Rutherglen (JIRA)
LogByteSizeMergePolicy doesn't keep segments under maxMergeMB
-

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9


Basically I'm trying to create largish 2-4GB shards using
LogByteSizeMergePolicy, however I've found in the attached unit
test segments that exceed maxMergeMB.

The goal is for segments to be merged up to 2GB, then all
merging to that segment stops, and then another 2GB segment is
created. This helps when replicating in Solr where if a single
optimized 60GB segment is created, the machine stops working due
to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1750) LogByteSizeMergePolicy doesn't keep segments under maxMergeMB

2009-07-16 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1750:
-

Attachment: LUCENE-1750.patch

Unit test illustrating the issue.

 LogByteSizeMergePolicy doesn't keep segments under maxMergeMB
 -

 Key: LUCENE-1750
 URL: https://issues.apache.org/jira/browse/LUCENE-1750
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1750.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 Basically I'm trying to create largish 2-4GB shards using
 LogByteSizeMergePolicy, however I've found in the attached unit
 test segments that exceed maxMergeMB.
 The goal is for segments to be merged up to 2GB, then all
 merging to that segment stops, and then another 2GB segment is
 created. This helps when replicating in Solr where if a single
 optimized 60GB segment is created, the machine stops working due
 to IO and CPU starvation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1749:


Attachment: LUCENE-1749.patch

Here is a start towards guessing the fieldcache ram usage.

It probably works fairly well, though it will be limited by stack space on a 
very heavily nested object graph.

I've added the size guess for getValue in the introspection output.

Its a start anyway.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch, LUCENE-1749.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732297#action_12732297
 ] 

Mark Miller commented on LUCENE-1749:
-

We prob would want to provide an alternate toString that includes the ram guess 
and the default that skips it - i havn't tested performance, but it might take 
a while to check a gigantic string array.

 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch, LUCENE-1749.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1749) FieldCache introspection API

2009-07-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732297#action_12732297
 ] 

Mark Miller edited comment on LUCENE-1749 at 7/16/09 6:35 PM:
--

We prob would want to provide an alternate toString that includes the ram guess 
and the default that skips it - i havn't tested performance, but it might take 
a while to check a gigantic string array.

Also, JavaImpl should probably actually be JavaMemoryModel or MemoryModel.

  was (Author: markrmil...@gmail.com):
We prob would want to provide an alternate toString that includes the ram 
guess and the default that skips it - i havn't tested performance, but it might 
take a while to check a gigantic string array.
  
 FieldCache introspection API
 

 Key: LUCENE-1749
 URL: https://issues.apache.org/jira/browse/LUCENE-1749
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Priority: Minor
 Fix For: 2.9

 Attachments: fieldcache-introspection.patch, LUCENE-1749.patch


 FieldCache should expose an Expert level API for runtime introspection of the 
 FieldCache to provide info about what is in the FieldCache at any given 
 moment.  We should also provide utility methods for sanity checking that the 
 FieldCache doesn't contain anything odd...
* entries for the same reader/field with different types/parsers
* entries for the same field/type/parser in a reader and it's subreader(s)
* etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org