[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-15 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857375#action_12857375
 ] 

Tim Smith commented on LUCENE-2324:
---

bq. But... could we allow an add/updateDocument call to express this affinity, 
explicitly?

i would love to be able to explicitly define a segment affinity for documents 
i'm feeding

this would then allow me to say: 
all docs from table a has affinity 1
all docs from table b has affinity 2

this would ideally result in indexing documents from each table into a 
different segment (obviously, i would then need to be able to have segment 
merging be affinity aware so optimize/merging would only merge segments that 
share an affinity)

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1

 Attachments: lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-15 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857385#action_12857385
 ] 

Tim Smith commented on LUCENE-2324:
---

bq. Probably if you really want to keep the segments segregated like that, you 
should in fact index to separate indices?

Thats what i'm currently thinking i'll have to do

however it would be ideal if i could either subclass IndexWriter or use 
IndexWriter directly with this affinity concept (potentially writing my own 
segment merger that is affinity aware)
that makes it so i can easily use near real time indexing, as only one 
IndexWriter will be in the mix, as well as make managing deletes and a whole 
other host of issues with multiple indexes disappear
Also makes it so i can configure memory settings across all affinity groups 
instead of having to dynamically create them, each with their own memory bounds

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1

 Attachments: lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2071) Allow updating of IndexWriter SegmentReaders

2010-03-30 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851388#action_12851388
 ] 

Tim Smith commented on LUCENE-2071:
---

+1

I have a special subclassed IndexSearcher that certain special queries require, 
so IndexWriter's delete by query will fail as an IndexSearcher is passed in

With this added method, i would be able to construct my own Searcher over the 
readers and then apply deletes properly

This would also allow counting the deletes as they occur as well (which is 
commonly desired when deleting by query)


It would be nice if this method would also work with non-pooled readers

so my desired method signature would be:
void updateReaders(Readers callback, boolean pooled)

if the readers were already pooled, this would have no effect, otherwise it 
would just open the segment readers just like the non-pooled delete readers are 
opened

 Allow updating of IndexWriter SegmentReaders
 

 Key: LUCENE-2071
 URL: https://issues.apache.org/jira/browse/LUCENE-2071
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2071.patch


 This discussion kind of started in LUCENE-2047.  Basically, we'll allow users 
 to perform delete document, and norms updates on SegmentReaders that are 
 handled by IndexWriter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2071) Allow updating of IndexWriter SegmentReaders

2010-03-30 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851528#action_12851528
 ] 

Tim Smith commented on LUCENE-2071:
---

found a couple of small issues with the patch attached to this ticket:

1. applyDeletes issue

saw this was in another ticket

think the flush should be flush(true, true, false)
and applyDeletes() should be called in the synchronized block


2. IndexWriter.changeCount not updated

the call() method does not return a boolean indicating if there were any 
changes that would need to be committed

as a result, if no other changes are made to the indexwriter, the commit will 
be skipped, even though deletes/norm updates were sent in
IndexReader.reopen() will then return the old reader without the deletes/norms



 Allow updating of IndexWriter SegmentReaders
 

 Key: LUCENE-2071
 URL: https://issues.apache.org/jira/browse/LUCENE-2071
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2071.patch


 This discussion kind of started in LUCENE-2047.  Basically, we'll allow users 
 to perform delete document, and norms updates on SegmentReaders that are 
 handled by IndexWriter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850127#action_12850127
 ] 

Tim Smith commented on LUCENE-2345:
---

bq. I think we should only commit this only on 3.1 (new feature)? 

3.1 only of course (just posted a 3.0 patch now as that's what i'm using and i 
need the functionality now)

bq. Tim, do you think the plugin model (extension by composition) would be 
workable for your use case? Ie, instead of a factory enabling subclasses of 
SegmentReader?

As long as the plugin model allows the same capabilities, that could work just 
fine and could be the final solution for this ticket.

I mainly need the ability to add data structures to a SegmentReader that will 
be shared for all SegmentReader's for a segment, and then add some extra meta 
information on a per instance basis

Is there a ticket or wiki page that details the plugin architecture/design so 
i could take a look?

However, would the plugins allow overriding specific IndexReader methods?

I still would see the need to be able to override specific methods for a 
SegmentReader (in order to track statistics/provide 
changed/different/faster/more feature rich implementations)
I don't have a direct need for this right now, however i could envision needing 
this in the future

Here's a few requirements i would pose for the plugin model (maybe they are 
already though of):
* Plugins have hooks to reopen themselves (some plugins can be shared across 
all instances of a SegmentReader)
** These reopen hooks would be called during SegmentReader.reopen()
* Plugins are initialized during SegmentReader.get/SegmentReader.reopen
** plugins should not have to be added after the fact, as this would not allow 
proper warming/initializing of plugins inside the NRT indexing
** i assume this would need be added as some list of PluginFactories added to 
the IndexWriter/IndexReader.open()?
* Plugins should have a close method that is called in SegmentReader.close()
** This will allow proper release of any resources
* Plugins are passed an instance of the SegmentReader they are for
** Plugins should be able to access all methods on a SegmentReader
** This would effectively allow overriding a SegmentReader by having a plugin 
provide the functionality instead (however only people explicitly calling the 
plugin would get this benefit)





 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Tim Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-2345:
--

Attachment: LUCENE-2345_3.0.plugins.patch

Here's a patch (again, against 3.0) showing the minimal API i would like to see 
from the plugin model

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850323#action_12850323
 ] 

Tim Smith commented on LUCENE-2345:
---

found one issue with the plugins patch

With NRT indexing, if the SegmentReader is opened with no TermInfosReader (for 
merging), then the plugins will be initialized with a SegmentReader that has no 
ability to walk the TermsEnum.

I guess SegmentPlugin initialization should wait until after the terms index is 
loaded or have another method for catching this event to the SegmentPlugin 
interface


 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850361#action_12850361
 ] 

Tim Smith commented on LUCENE-2345:
---

bq. My patch removes loadTermsIndex method from SegmentReader and requires you 
to reopen it. 

that's definitely much cleaner and would solve the issue in my current patch 
(sadly i'm on 3.0 and want to keep my patch there at a minimum until i can port 
to all the goodness on 3.1).

bq. Also, they extend not only SegmentReader, but the whole hierarchy - SR, MR, 
DR, whatever.

i just wussed out and just did only the SegmentReader case as thats all i need 
right now

bq. as all the hooks are on the factory classes

could you post your factory class interface?
If i base my 3.0 patch off that i can reduce my 3.1 port overhead.


are there any tickets tracking your reopen refactors or your plugin model?
If not, feel free to retool this ticket for your plugin model for Index Readers 
as that will solve my use cases (and then some)

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-25 Thread Tim Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-2345:
--

Attachment: LUCENE-2345_3.0.patch

Here's a patch against 3.0 that provides the SegmentReaderFactory ability
(not tested yet, but i'll be doing that shortly as i integrate this 
functionality)

It adds a SegmentReaderFactory.

The IndexWriter now has a getter and setter for setting this

SegmentReader has a new protected method init() which is called after the 
segment reader has been initialized (to allow subclasses to hook this action 
and do additional initialization, etc

added 2 new IndexReader.open() calls that allow specifying the 
SegmentReaderFactory



 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-25 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849731#action_12849731
 ] 

Tim Smith commented on LUCENE-2345:
---

that was my plan

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-24 Thread Tim Smith (JIRA)
Make it possible to subclass SegmentReader
--

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1


I would like the ability to subclass SegmentReader for numerous reasons:
* to capture initialization/close events
* attach custom objects to an instance of a segment reader (caches, statistics, 
so on and so forth)
* override methods on segment reader as needed

currently this isn't really possible

I propose adding a SegmentReaderFactory that would allow creating custom 
subclasses of SegmentReader

default implementation would be something like:
{code}
public class SegmentReaderFactory {
  public SegmentReader get(boolean readOnly) {
return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
  }

  public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
return newSegmentReader(readOnly);
  }
}
{code}

It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
(for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
etc)

I could prepare a patch if others think this has merit

Obviously, this API would be experimental/advanced/will change in future




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2010-03-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849358#action_12849358
 ] 

Tim Smith commented on LUCENE-1821:
---

This would actually be solved by LUCENE-2345 for me as i would then be able to 
tag SegmentReaders with any additional accounting information i would need

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849455#action_12849455
 ] 

Tim Smith commented on LUCENE-2345:
---

that's the reassurance i needed :)

will start working on a patch tomorrow 
will take a few days as i'll start with a 3.0 patch (which i use), then will 
create a 3.1 patch once i've got that all flushed out

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849497#action_12849497
 ] 

Tim Smith commented on LUCENE-2345:
---

i'll do my initial work on 3.0 so i can absorb the changes now and will post 
that patch

at which point, i can wait for you to finish whatever you need, or we can just 
incorporate the same ability into your patch for the other ticket
i would just like to see the ability to subclass SegmentReader's on 3.1 so i 
don't have to port a patch when i absorb 3.1 (just use the finalized apis)



 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844930#action_12844930
 ] 

Tim Smith commented on LUCENE-2310:
---

Personally, i like keeping Fieldable, (or having AbstractField just with 
abstract methods and no actual implementation)

for feeding documents, i use custom Fieldable implementations to reduce amount 
of setters called, as Fields of different types have different constant settings

 Reduce Fieldable, AbstractField and Field complexity
 

 Key: LUCENE-2310
 URL: https://issues.apache.org/jira/browse/LUCENE-2310
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Index
Reporter: Chris Male
 Attachments: LUCENE-2310-Deprecate-AbstractField.patch


 In order to move field type like functionality into its own class, we really 
 need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
 Currently AbstractField depends on Field, and does not provide much more 
 functionality that storing fields, most of which are being moved over to 
 FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
 possible Fieldable), moving much of the functionality into Field and 
 FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-03-01 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839682#action_12839682
 ] 

Tim Smith commented on LUCENE-2283:
---

i haven't been able to fully replicate this issue in a unit test scenario, 

however it will definitely resolve that 40M of ram that was allocated and never 
released for the RAMFiles on the StoredFieldsWriter (keeping that bound to the 
configured memory size)

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2283.patch, LUCENE-2283.patch, LUCENE-2283.patch


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838976#action_12838976
 ] 

Tim Smith commented on LUCENE-2283:
---

I'll work up another patch

might take me a few minutes to get my head wrapped around the 
TermVectorsTermsWriter stuff

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2283.patch


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-26 Thread Tim Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-2283:
--

Attachment: LUCENE-2283.patch

Here's a new patch with your suggestions

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2283.patch, LUCENE-2283.patch


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-25 Thread Tim Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-2283:
--

Attachment: LUCENE-2283.patch

Here's a patch for using a pool for stored fields buffers


 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2283.patch


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837793#action_12837793
 ] 

Tim Smith commented on LUCENE-2283:
---

I came across this issue looking for a reported memory leak during indexing

a yourkit snapshot showed that the PerDocs for an IndexWriter were using ~40M 
of memory (at which point i came across this potentially unbounded memory use 
in StoredFieldsWriter)
this snapshot seems more or less at a stable point (memory grows but then 
returns to a normal state), however i have reports that eventually the memory 
is completely exhausted resulting in out of memory errors.

I so far have not found any other major culprit in the lucene indexing code.

This index receives a routine mix of very large and very small documents (which 
would explain this situation)
The VM and system have more than ample amount of memory given the buffer size 
and what should be normal indexing RAM requirements.

Also, a major difference between this leak not occurring and it showing up is 
that previously, the IndexWriter was closed when performing commits, now the 
IndexWriter remains open (just calling IndexWriter.commit()). So, if any memory 
is leaking during indexing, it is no longer being reclaimed during commit. As a 
side note, closing the index writer at commit time would sometimes fail, 
resulting in some following updates to fail because the index writer was locked 
and couldn't be reopened until the old index writer was garbage collected, so i 
don't want to go back to this for commits.

Its possible there is a leak somewhere else (i currently do not have a snapshot 
right before out of memory issues occur, so currently the only thing that 
stands out is the PerDoc memory use)

As far as a fix goes, wouldn't it be better to have the RAMFile's used for 
stored fields pull and return byte buffers from the byte block pool on the 
DocumentsWriter? This would allow the memory to be reclaimed based on the index 
writers buffer size (otherwise there is no configurable way to tune this memory 
use)



 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837821#action_12837821
 ] 

Tim Smith commented on LUCENE-2283:
---

ramBufferSizeMB is 64MB

Here's the yourkit breakdown per class:
* DocumentsWriter - 256 MB
** TermsHash - 38.7 MB
** StoredFieldsWriter - 37.5 MB
** DocumentsWriterThreadState - 36.2 MB
** DocumentsWriterThreadState - 34.6 MB
** DocumentsWriterThreadState - 33.8 MB
** DocumentsWriterThreadState - 27.5 MB
** DocumentsWriterThreadState - 13.4 MB

I'm starting to dig into the ThreadStates now to see if anything stands out here

bq. Hmm, that makes me nervous, because I think in this case the use should be 
bounded.

I should be getting a new profile dump at crash time soon, so hopefully that 
will make things clearer

bq. That doesn't sound good! Can you post some details on this (eg an 
exception)?

If i recall correctly, I think the exception was caused by an out of disk space 
situation (which would recover)
obviously not much that can be done about this other than adding more disk 
space, however the situation would recover, but docs would be lost in the 
interum

bq. But, anyway, keeping the same IW open and just calling commit is (should 
be) fine.

Yeah, this should be the way to go, especially as it results in the pooled 
buffers not needing to be reallocated/reclaimed/etc, however right now this is 
the only change i can currently think of that could result in memory issues.

bq. Yes, that's a great solution - a single pool. But that's a somewhat bigger 
change. 

Seems like this would be the best approach as it makes the memory bounded by 
the configuration of the engine, giving better reuse of byte blocks and better 
ability to reclaim memory (in DocumentsWriter.balanceRAM())




 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837875#action_12837875
 ] 

Tim Smith commented on LUCENE-2283:
---

bq. I agree. I'll mull over how to do it... unless you're planning on consing 
up a patch 

I'd love to, but don't have the free cycles at the moment :(

bq. How many threads do you pass through IW?

honestly don't 100% know about the origin of the threads i'm given
In general, they should be from a static pool, but may be dynamically allocated 
if the static pool runs out

One thought i had recently was to control this more tightly by having a limited 
number of static threads that called IndexWriter methods in case that was the 
issue (but that would be a pretty big change)

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837881#action_12837881
 ] 

Tim Smith commented on LUCENE-2283:
---

latest profile dump has pointed out a non-lucene issue as causing some memory 
growth

so feel free to drop down priority

however it seems like using the bytepool for the stored fields would be good 
overall

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837919#action_12837919
 ] 

Tim Smith commented on LUCENE-2283:
---

another note is that this was on 64 bit vm

i've noticed that all the memsize calculations assume 4 byte pointers, so 
perhaps that can lead to more memory being used that would otherwise be 
expected (although 256 MB is still well over the 2X mem use that would 
potentially be expected in that case)



 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838017#action_12838017
 ] 

Tim Smith commented on LUCENE-2283:
---

i'm working up a patch for the shared byteblock pool for stored field buffers 
(found a few cycles)


 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-02-23 Thread Tim Smith (JIRA)
Possible Memory Leak in StoredFieldsWriter
--

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith


StoredFieldsWriter creates a pool of PerDoc instances

this pool will grow but never be reclaimed by any mechanism

furthermore, each PerDoc instance contains a RAMFile.
this RAMFile will also never be truncated (and will only ever grow) (as far as 
i can tell)

When feeding documents with large number of stored fields (or one large 
dominating stored field) this can result in memory being consumed in the 
RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
large, even if large documents are rare.

Seems like there should be some attempt to reclaim memory from the PerDoc[] 
instance pool (or otherwise limit the size of RAMFiles that are cached) etc


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2276) Add IndexReader.document(int, Document, FieldSelector)

2010-02-22 Thread Tim Smith (JIRA)
Add IndexReader.document(int, Document, FieldSelector)
--

 Key: LUCENE-2276
 URL: https://issues.apache.org/jira/browse/LUCENE-2276
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Tim Smith


The Document object passed in would be populated with the fields identified by 
the FieldSelector for the specified internal document id

This method would allow reuse of Document objects when retrieving stored fields 
from the index



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader

2009-12-15 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790803#action_12790803
 ] 

Tim Smith commented on LUCENE-1923:
---

added getName() in case anyone is currently relying on current (default) output 
from toString() on index readers

feel free to rename the getName() methods to toString()

 Add toString() or getName() method to IndexReader
 -

 Key: LUCENE-1923
 URL: https://issues.apache.org/jira/browse/LUCENE-1923
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
Assignee: Michael McCandless
 Attachments: LUCENE-1923.patch


 It would be very useful for debugging if IndexReader either had a getName() 
 method, or a toString() implementation that would get a string identification 
 for the reader.
 for SegmentReader, this would return the same as getSegmentName()
 for Directory readers, this would return the generation id?
 for MultiReader, this could return something like multi(sub reader name, sub 
 reader name, sub reader name, ...)
 right now, i have to check instanceof for SegmentReader, then call 
 getSegmentName(), and for all other IndexReader types, i would have to do 
 something like get the IndexCommit and get the generation off it (and this 
 may throw UnsupportedOperationException, at which point i have would have to 
 recursively walk sub readers and try again)
 I could work up a patch if others like this idea

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader

2009-12-08 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787472#action_12787472
 ] 

Tim Smith commented on LUCENE-1923:
---

i won't have the time till after the new year.

if someone else wants to work up a patch, go for it (this seems simple enough 
and adds some nice info capabilities for logging/etc), otherwise, i'll get to 
it when i can

 Add toString() or getName() method to IndexReader
 -

 Key: LUCENE-1923
 URL: https://issues.apache.org/jira/browse/LUCENE-1923
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith

 It would be very useful for debugging if IndexReader either had a getName() 
 method, or a toString() implementation that would get a string identification 
 for the reader.
 for SegmentReader, this would return the same as getSegmentName()
 for Directory readers, this would return the generation id?
 for MultiReader, this could return something like multi(sub reader name, sub 
 reader name, sub reader name, ...)
 right now, i have to check instanceof for SegmentReader, then call 
 getSegmentName(), and for all other IndexReader types, i would have to do 
 something like get the IndexCommit and get the generation off it (and this 
 may throw UnsupportedOperationException, at which point i have would have to 
 recursively walk sub readers and try again)
 I could work up a patch if others like this idea

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1923) Add toString() or getName() method to IndexReader

2009-12-08 Thread Tim Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-1923:
--

Attachment: LUCENE-1923.patch

Here's a simple patch to get the ball rolling

This adds a getName() method to IndexReader

the default implementation will be:
SimleClassName(subreader.getName(), subreader.getName(), ...)

SegmentReader will return same value as getSegmentName()

DirectoryReader will return:
DirectoryReader(segment_N, segment.getName(), segment.getName(), ...)

ParallelReader will return:
ParallelReader(parallelReader1.getName(), parallelReader2.getName(), ...)

this currently does not have a toString() implementation return getName() 

do with this patch as you will


 Add toString() or getName() method to IndexReader
 -

 Key: LUCENE-1923
 URL: https://issues.apache.org/jira/browse/LUCENE-1923
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Attachments: LUCENE-1923.patch


 It would be very useful for debugging if IndexReader either had a getName() 
 method, or a toString() implementation that would get a string identification 
 for the reader.
 for SegmentReader, this would return the same as getSegmentName()
 for Directory readers, this would return the generation id?
 for MultiReader, this could return something like multi(sub reader name, sub 
 reader name, sub reader name, ...)
 right now, i have to check instanceof for SegmentReader, then call 
 getSegmentName(), and for all other IndexReader types, i would have to do 
 something like get the IndexCommit and get the generation off it (and this 
 may throw UnsupportedOperationException, at which point i have would have to 
 recursively walk sub readers and try again)
 I could work up a patch if others like this idea

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-12-07 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786921#action_12786921
 ] 

Tim Smith commented on LUCENE-1859:
---

close if you like

application writers can add guards for this if they like/need to as a custom 
TokenFilter

mainly created this ticket as this can result in an unbound buffer should 
people use the token stream api incorrectly (or against suggestions of lucene 
core developers)

 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-23 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781615#action_12781615
 ] 

Tim Smith commented on LUCENE-2086:
---

Got some performance numbers:

Description of test (NOTE: this is representative of actions that may occur in 
a running system (not a contrived test)):
* feed 4 million operations (3/4 are deletes, 1/4 are updates (single field))
* commit
* feed 1 million updates (about 1/3 are updates, 2/3/ deletes (randomly 
selected))
* commit

Numbers:
|| Desc || Old || New ||
| feed 4 million | 56914ms | 15698ms |
| commit 4 million | 9072ms | 14291ms |
| total (4 million) | 65986ms | 29989ms | 
| update 1 million | 46096ms | 11340ms |
| commit 1 million | 13501ms | 9273ms | 
| total (1 million) | 59597ms | 20613ms |

This shows significant improvements with new patched data (1/3 the time for 1 
million, about 1/2 the time for initial 4 million feed)

This means i'm gonna definitely need to incorporate this patch while i'm still 
on 3.0 (will upgrade to 3.0 as soon as its out, then apply this fix) 
Ideally, a 3.0.1 would be forthcoming in the next month or so with this fix so 
i wouldn't have to maintain this patched overlay of code






 When resolving deletes, IW should resolve in term sort order
 

 Key: LUCENE-2086
 URL: https://issues.apache.org/jira/browse/LUCENE-2086
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2086.patch


 See java-dev thread IndexWriter.updateDocument performance improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-20 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780698#action_12780698
 ] 

Tim Smith commented on LUCENE-2086:
---

any chance this can go into 3.0.0 or a 3.0.1?


 When resolving deletes, IW should resolve in term sort order
 

 Key: LUCENE-2086
 URL: https://issues.apache.org/jira/browse/LUCENE-2086
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2086.patch


 See java-dev thread IndexWriter.updateDocument performance improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-20 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780701#action_12780701
 ] 

Tim Smith commented on LUCENE-2086:
---

i've seen the deletes dominating commit time quite often, so obviously it would 
be very useful to be able to absorb this optimization sooner than later (whats 
the timeframe for 3.1?)

otherwise i'll have to override the classes involved and pull in this patch 
(never like this approach myself)

 When resolving deletes, IW should resolve in term sort order
 

 Key: LUCENE-2086
 URL: https://issues.apache.org/jira/browse/LUCENE-2086
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2086.patch


 See java-dev thread IndexWriter.updateDocument performance improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-20 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780710#action_12780710
 ] 

Tim Smith commented on LUCENE-2086:
---

bq. maybe try it  report back?

i'll see if i can find some cycles to try this against the most painful use 
case i have

bq. I'd rather see us release a 3.1 sooner rather than later, instead.

yes please.
I would definitely like to see a more accelerated release cycle (even if less 
functionality gets into each minor release)

 When resolving deletes, IW should resolve in term sort order
 

 Key: LUCENE-2086
 URL: https://issues.apache.org/jira/browse/LUCENE-2086
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2086.patch


 See java-dev thread IndexWriter.updateDocument performance improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public

2009-11-12 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777008#action_12777008
 ] 

Tim Smith commented on LUCENE-1909:
---

I have the following use case:

i have a configuration bean, this bean can be customized via xml at config time
in this bean, i expose the setting for the terms index divisor
so, my bean has to have a default value for this,

right now, i just use 1 for the default value.
would be nice if i could just use the lucene constant instead of using 1, as 
the lucene constant could change in the future (not really likely, but its one 
less constant i have to maintain)

if the default is not made public i have 2 options:
# use a hard coded constant in my code for the default value (doing this right 
now)
# use an Integer object, and have null be the default

the nasty part about the second option is that i now have to do conditional 
opening of the reader depending on if null is the value (unset), when it would 
be much simpler (and easier for me to maintain), if i just always pass in that 
value


 Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
 ---

 Key: LUCENE-1909
 URL: https://issues.apache.org/jira/browse/LUCENE-1909
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE_1909.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public

2009-11-12 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777064#action_12777064
 ] 

Tim Smith commented on LUCENE-1909:
---

users can see the live setting via things like JMX/admin ui
also, if i intend users to actually change the value regularly, i can provide 
user facing documentation that would go into detail without the user needing to 
dig further into lucene internals (memory tuning guide or something)
currently just exposing this setting myself as a SUPER ADVANCED setting (just 
in case it will need to be tuned for custom use cases in the future) (can't 
tune it if its not exposed in config)



 Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
 ---

 Key: LUCENE-1909
 URL: https://issues.apache.org/jira/browse/LUCENE-1909
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE_1909.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public

2009-11-12 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777109#action_12777109
 ] 

Tim Smith commented on LUCENE-1909:
---

what you describe requires effectively 2 settings:
* custom term infos divisor enabled/disabled
* configured value if enabled

this then results in more complexity in opening the index reader (conditional 
opening where a non-conditional open with the configured divisor would do the 
trick)
any admin ui would also require more conditional handling of displaying this 
setting (as you described) (i'm not displaying it other than in JMX now anyway, 
so it doesn't really matter for me, and JMX just has a readonly attribute that 
shows the configured value (1 by default))

personally, i don't care too much if this constant is made public or not (would 
make it so i use that constant instead of defining my own with the same value), 
so it only saves me 1 line (and its not like the default will ever change from 
1 in the lucene code anyway)



 Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
 ---

 Key: LUCENE-1909
 URL: https://issues.apache.org/jira/browse/LUCENE-1909
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE_1909.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public

2009-11-12 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777118#action_12777118
 ] 

Tim Smith commented on LUCENE-1909:
---

Only thing i would want the constant for is to known what the default divisor 
is. The default just happens to be 1 (no divisor/off).

However (while unlikely) a new version of lucene could default to using a real 
divisor (maybe once everyone is on solid state disks, a higher divisor will 
result in the same speed of access, with less memory use), at which point, if i 
upgrade to a new version of lucene, i want to inherit that changed setting (as 
the default was selected by people that probably know better than me what will 
better server the general use of lucene in terms of memory and performance)

right now, if i want to inherit the default i would have to do a conditional 
IndexReader.open() and store my setting as a pair (enabled/disabled, divisor), 
which could be encoded in an Integer object (null = disabled/use lucene default)

if the constant is made public, its easier for me to inherit that default 
setting.
of course at the end of the day, either approach will only be about 5 lines of 
code difference, so again, i don't really care too much about the outcome of 
this

bq. By the way, if you use a final constant, without recompiling it would never 
change,...

I never drop a new lucene in without recompiling (so that doesn't cause any 
difference for me)

 Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
 ---

 Key: LUCENE-1909
 URL: https://issues.apache.org/jira/browse/LUCENE-1909
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE_1909.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public

2009-11-12 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777136#action_12777136
 ] 

Tim Smith commented on LUCENE-1909:
---

bq. If you want to inherit the setting, use the correct constructor 

agreed, just a tiny bit of more complexity on my side for that (but its so 
insignificant that it doesn't really matter, and is really not even worth 
arguing over)

if the constant was public, i'd use it, if not, no worries (for me at least)

bq. By the default the feature is off. You can't inherit anything about it.

ideally, i want to inherit that the feature is off by default, then allow 
config to turn it on (by providing a value greater than one for this setting, 
or just 1 to allow config to explicitly disable)
using the constructor with no divisor does this (i just need to call the 
constructor conditionally depending on if the setting was explicitly 
configured), thats easy and is no problem to do at all (just a couple of extra 
lines of code in my app layer)



 Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
 ---

 Key: LUCENE-1909
 URL: https://issues.apache.org/jira/browse/LUCENE-1909
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE_1909.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1923) Add toString() or getName() method to IndexReader

2009-09-23 Thread Tim Smith (JIRA)
Add toString() or getName() method to IndexReader
-

 Key: LUCENE-1923
 URL: https://issues.apache.org/jira/browse/LUCENE-1923
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith


It would be very useful for debugging if IndexReader either had a getName() 
method, or a toString() implementation that would get a string identification 
for the reader.

for SegmentReader, this would return the same as getSegmentName()
for Directory readers, this would return the generation id?
for MultiReader, this could return something like multi(sub reader name, sub 
reader name, sub reader name, ...)

right now, i have to check instanceof for SegmentReader, then call 
getSegmentName(), and for all other IndexReader types, i would have to do 
something like get the IndexCommit and get the generation off it (and this may 
throw UnsupportedOperationException, at which point i have would have to 
recursively walk sub readers and try again)

I could work up a patch if others like this idea

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader

2009-09-23 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758717#action_12758717
 ] 

Tim Smith commented on LUCENE-1923:
---

I'll work up a patch that will do the following:

add getName() method to IndexReader (and all subclasses (SegmentReader, 
DirectoryReader, MultiReader, any others i'm not currently aware of that i 
track down)

have toString() return indexreaderclassname(getName())

so, toString for a SegmentReader will look something like:
org.apache.lucene.index.SegmentReader(_ae)

for a DirectoryReader, it'll look like:
org.apache.lucene.index.DirectoryReader(segments_7)



 Add toString() or getName() method to IndexReader
 -

 Key: LUCENE-1923
 URL: https://issues.apache.org/jira/browse/LUCENE-1923
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith

 It would be very useful for debugging if IndexReader either had a getName() 
 method, or a toString() implementation that would get a string identification 
 for the reader.
 for SegmentReader, this would return the same as getSegmentName()
 for Directory readers, this would return the generation id?
 for MultiReader, this could return something like multi(sub reader name, sub 
 reader name, sub reader name, ...)
 right now, i have to check instanceof for SegmentReader, then call 
 getSegmentName(), and for all other IndexReader types, i would have to do 
 something like get the IndexCommit and get the generation off it (and this 
 may throw UnsupportedOperationException, at which point i have would have to 
 recursively walk sub readers and try again)
 I could work up a patch if others like this idea

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-09-18 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757199#action_12757199
 ] 

Tim Smith commented on LUCENE-1821:
---

I've been playing with per-segment caches for the last couple of weeks and have 
got everything working pretty well

However, i have to end up doing a lot of mapping between an IndexReader 
instance, and the index into the IndexReader[] array of the IndexSearcher
this then allows me to easily get the proper document offset where needed, 
and/or get a handle on the proper per-segment cache/evaluation object/etc

For my use cases, it would be much easier if the following methods were 
available:

on Weight:
{code}
// readerId is the i in the for (int i = 0; i  readers.length; ++i) in 
IndexSearcher
// NOTE: that readerId is at the IndexSearcher level, not the MultiSearcher 
level
public Scorer scorer(IndexReader reader, int readerId, boolean inOrder, boolean 
topLevel);
{code}

on Collector:
{code}
public void setNextReader(IndexReader reader, int docBase, int readerId);
// NOTE: this isn't extremely needed, as its easier to get the readerId from 
docBase (using a cached int[] of docbases for the searcher)
{code}

I suppose i could use the fact that these methods will always be called in 
order, keeping and incrementing counter, however the javadoc explicitly says 
that these methods may be called out of segment order to be more efficient in 
the future. It would therefore be very useful if these indexes were passed into 
these methods.

To work around this, my searcher currently has a getReaderIdForReader() method 
very similar to my earlier proposed getIndexReaderBase() method




 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1915) Add static openInput(File,...) methods to all FSDirectory implementations

2009-09-17 Thread Tim Smith (JIRA)
Add static openInput(File,...) methods to all FSDirectory implementations
-

 Key: LUCENE-1915
 URL: https://issues.apache.org/jira/browse/LUCENE-1915
 Project: Lucene - Java
  Issue Type: Wish
  Components: Store
Reporter: Tim Smith


It would be really useful if NIOFSDirectory and MMapDirectory had static 
methods for opening an input for arbitrary Files
SimpleFSDirectory should likewise have a static openInput(File) method in order 
to cover all basis (right now, SimpleFSIndexInput only has protected access

This allows creating a custom FSDirectory implementation that can use any 
criteria desired to determine what Input implementation to use for opening a 
file.

I know the FileSwitchDirectory provides some ability to do this, however that 
locks the selection criteria down to only the file extension in use
also, the FileSwitchDirectory approach seems to want to have each directory at 
different paths (as list() methods just cat the directory listings of the sub 
directories, which could cause havoc if both sub directories point to the same 
FS path?)

opening up these static openInput() methods would allow creating a custom FS 
store implementation that would could for instance mmap files of a particular 
type and size and use NIO for other files, and mabye even use the SimpleFS 
input for a third category of files. Could also then apply different buffer 
sizes to different files, perform RAM caching of particular inputs, etc


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-08-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748071#action_12748071
 ] 

Tim Smith commented on LUCENE-1859:
---

b1. The worst-case scenario seems kind of theoretical
100% agree, but even if one extremely large token gets added to the stream (and 
possibly dropped prior to indexing), the char[] grows without ever shrinking 
back (so it can result in memory usage growing if bad content is thrown in 
(and people have no shortage of bad content)

bq. Is a priority of major justified?

major is just the default priority (feel free to change)

bq. I assume that, based on this report, TermAttributeImpl never gets reset or 
discarded/recreated over the course of an indexing session?
using reusable TokenStream will never cause the buffer to be nulled (as far as 
i can tell) for the lifetime of the thread (please correct me if i'm wrong on 
this)


i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing 
this to be statically updated), just as a means to bound the max memory used 
here
currently, the memory use is bounded by Integer.MAX_VALUE (which is really big)
If someone feeds a large text document with no spaces or other delimiting 
characters, a non-intelligent tokenizer would treat this a 1 big token (and 
grow the char[] accordingly)

 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-08-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748071#action_12748071
 ] 

Tim Smith edited comment on LUCENE-1859 at 8/26/09 11:31 AM:
-

bq. The worst-case scenario seems kind of theoretical
100% agree, but even if one extremely large token gets added to the stream (and 
possibly dropped prior to indexing), the char[] grows without ever shrinking 
back (so it can result in memory usage growing if bad content is thrown in 
(and people have no shortage of bad content)

bq. Is a priority of major justified?

major is just the default priority (feel free to change)

bq. I assume that, based on this report, TermAttributeImpl never gets reset or 
discarded/recreated over the course of an indexing session?
using reusable TokenStream will never cause the buffer to be nulled (as far as 
i can tell) for the lifetime of the thread (please correct me if i'm wrong on 
this)


i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing 
this to be statically updated), just as a means to bound the max memory used 
here
currently, the memory use is bounded by Integer.MAX_VALUE (which is really big)
If someone feeds a large text document with no spaces or other delimiting 
characters, a non-intelligent tokenizer would treat this a 1 big token (and 
grow the char[] accordingly)

  was (Author: tsmith):
b1. The worst-case scenario seems kind of theoretical
100% agree, but even if one extremely large token gets added to the stream (and 
possibly dropped prior to indexing), the char[] grows without ever shrinking 
back (so it can result in memory usage growing if bad content is thrown in 
(and people have no shortage of bad content)

bq. Is a priority of major justified?

major is just the default priority (feel free to change)

bq. I assume that, based on this report, TermAttributeImpl never gets reset or 
discarded/recreated over the course of an indexing session?
using reusable TokenStream will never cause the buffer to be nulled (as far as 
i can tell) for the lifetime of the thread (please correct me if i'm wrong on 
this)


i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing 
this to be statically updated), just as a means to bound the max memory used 
here
currently, the memory use is bounded by Integer.MAX_VALUE (which is really big)
If someone feeds a large text document with no spaces or other delimiting 
characters, a non-intelligent tokenizer would treat this a 1 big token (and 
grow the char[] accordingly)
  
 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-08-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748082#action_12748082
 ] 

Tim Smith commented on LUCENE-1859:
---

bq. which non-intelligent tokenizers are you referring to? nearly all the 
lucene tokenizers have 255 as a limit.

perhaps this is a non-issue with regards to lucene tokenizers
however, Tokenizers can be implemented by anyone (not sure if there are 
adequate warnings about keeping tokens short)
it also may not be possible to keep tokens short, i may need to index a rather 
long id string in a TokenStream fashion which will grow the buffer without 
reclaiming this

perhaps it should be the responsibility of the Tokenizer to shrink the 
TermBuffer if it adds long tokens (but this will probably require some helper 
methods)

 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-08-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748077#action_12748077
 ] 

Tim Smith commented on LUCENE-1859:
---

bq. I would set this to minor and would not take care before 2.9.

i would agree with this

just reported the issue as it has the potential to cause memory issues (and 
would think something should be done about it (in the long term at least))
also, the AttributeSource stuff does result in TermAttributeImpl being held 
onto pretty much forever if using a reusableTokenStream (correct?)
was't a new Token() by the indexer for each doc/field in 2.4?, so the 
unbounding would only last at most for the duration of indexing that one 
document?
with Attribute caching in the TokenStream, the bounding lasts the duration of 
the TokenStream now (or its underlaying AttributeSource), which could remain 
until shutdown

 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-08-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748091#action_12748091
 ] 

Tim Smith commented on LUCENE-1859:
---

i fail to see the complexity of adding one method to TermAttribute:
{code}
public void shrinkBuffer(int maxSize) {
  if ((maxSize  termLength)  (buffer.length  maxSize)) {
termBuffer = new char[maxSize];
  } 
}
{code}

Not having this is fine as long as its well documented that emitting large 
tokens can and will result in memory growing uncontrolled (especially if using 
many indexing threads)

 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-08-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748103#action_12748103
 ] 

Tim Smith commented on LUCENE-1859:
---

bq. Death by a thousand cuts. This is one cut.

by this logic, nothing new can ever be added. 
The thing that brought this to my attention was the new TokenStream API (one 
cut (rather big, but i like the new API so i'm happy with the blood loss (makes 
me dizzy and happy)))
The new TokenStream API holds onto theses char[] much longer (if not forever), 
so this results in memory growing unbounded unless there is some facility to 
truncate/null out the char[]

bq. I wouldn't even add the note to the documentation.

I don't believe there is ever any valid argument against adding documentation.
If someone can shoot themselves in the foot with the gun you gave them, at 
least tell them not to point the gun at their foot with the safety off.

bq. The only reason to do this is to keep average memory usage down for the 
hell of it.
keeping average memory usage down prevents those wonderful OutOfMemory 
Exceptions (which are difficult at best to recover from)

 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-08-26 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748122#action_12748122
 ] 

Tim Smith commented on LUCENE-1859:
---

On documentation:
any warnings/precautions should always be called out (calling out the external 
link (wiki/etc) for in depth details)
in depth descriptions of the details can be pushed off to wiki pages or 
external references, as long as a link is provided for the curious, but i would 
still argue that they should exist

bq. this doesn't prevent the OOM, it just makes it less likely

all you can ever do for OOM issues is make them less likely (short of just 
fixing a bug that holds onto memory like mad). 
If accepting arbitrary content, there will always be a possibility of the 
content forcing OOM issues. In general, everything possible should be done to 
reduce the likelyhood of such OOM issues where possible (IMO).

 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-25 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747441#action_12747441
 ] 

Tim Smith commented on LUCENE-1849:
---

bq. If we were to provide a default in Collector, it should be a simple 
constant, not a variable.

in that case, it may be useful to have this method return false by default 
(expecting docs in order, as this is the default in 2.4)


 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-25 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747587#action_12747587
 ] 

Tim Smith commented on LUCENE-1849:
---

bq. I would prefer not to make a default here, ie, force an explicit choice, 
because it is an expert API. 

very reasonable

bq. BooleanQuery gets sizable gains in performance if you let it return docs 
out of order.

Any stats on the performance gains here available?
didn't see any on a cursory glance through javadoc

Also, are the implications of out of order docids coming back from nextDoc() 
well documented (javadoc?, wiki?)?
I guess out of order docids really screw up advance(int), so you should never 
call advance(int) if you allowed out of order collection for a Scorer?


 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-25 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747645#action_12747645
 ] 

Tim Smith commented on LUCENE-1849:
---

bq. Out-of-order scoring is only used for top-scorers today in Lucene

I see that FilteredQuery passes scoreDocsInOrder down to its sub query
Is this incorrect?
seems like this could cause problems as FilteredQuery does call nextDoc/advance 
on its sub query (which could be out of order because of this)

 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746809#action_12746809
 ] 

Tim Smith commented on LUCENE-1821:
---

bq. Actually sorting (during collection) already gives you the docBase so 
shouldn't your app already have the context needed for this?

Yes, i get the docbase and all during collection, so doing sorting with a top 
level cache will be no problem.
I was mainly using sorting as an example of some of the pain caused by 
per-segment searching/caches (the Collector API makes it easy enough to do 
sorting
on the top level or per segment, so i'm not concerned about integration here)

For my app, i plan to allow sorting to be either per-segment or top-level 
in order to allow people to choose thier poison: faster commit/less memory vs 
faster sorting
I also plan to do faceting likewise
certain features will always require a top-level cache (but those are advanced 
features anyway and should be expected to have impacts on commit time/first 
search time)

bq. Hmm... is advance in fact costly for your DocIdSets?

Think how costly it would be to do advance for the SortedVInt DocIdSet (linear 
search over compressed values)
for a bitset, this is instantaneous, but to conserve memory, its better to use 
a sorted int[] (or the SortedVInt stuff 2.9 provides)

in the end, i plan to bucketize the collected docs per segment, so in the end 
this should hopefully be less of an issue
nice thing about that approach is that i can have a bitset for one segment 
(lost of matches in this segment) and a very small int[] for a different 
segment based on the matches per segment. Biggest difficulty is doing the 
mapping to the per-segment DocIdSet (which will probably have to be slower)

bq. this one method would allow you to not have to subclass IndexSearcher.

I already have to subclass index searcher (i do a lot of extra stuff)
however, the IndexSearcher doesn't provide any protected access to its sub 
readers and doc starts, so i have to do this myself in my subclass's 
constructor (in the same way IndexSearcher is doing this

I would really like to see getIndexReaderBase() added to 2.9's IndexSearcher
I would also like to see the subreaders and docstarts either made protected or 
given protected accessor methods (so i don't have to recreate the same set of 
sub readers (and make sure i do this the same way for future versions of lucene)
Would also be nice to see a protected constructor on IndexSearcher like so:
{code}
  protected IndexSearcher(IndexReader reader, IndexReader[] subReaders, int[] 
docStarts) {
   ...
  }
{code}

This would allow creating temporary IndexSearchers much faster (don't need to 
gather sub readers)
This would allow:
* easily creating IndexSearcher that is top-level (subReaders[] would be 
length 1 and just contain reader)
* create a temporary IndexSearcher off another IndexSearcher that contains 
some short lived context (i have this use case)




 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 

[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746839#action_12746839
 ] 

Tim Smith commented on LUCENE-1821:
---

bq. Have you done any benching here? I think we actually found that even most 
sorting cases were faster than in 2.4.1.

I haven't done any benchmarking. 
I'm not arguing that 2.9 string sorting is slower than 2.4 string sorting, it 
may well be faster for every case.
per segment searching and other improvements potentially added more gains in 
performance than the new string sorting added losses in performance.

But, i can say rather confidently, that a large index with a bunch of segments 
will result in string sorting being slower when using a per segment string sort 
cache instead of a full index sort cache (think worst case using \*:\* query)

bq. loading a field cache off a multi-segment index was dog slow
this is a trade off.
slower cache loading in order to get faster sorting
i plan to provide the ability to do both, and allow specific use cases to 
decide what is best for them

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746842#action_12746842
 ] 

Tim Smith commented on LUCENE-1821:
---

I allow caches to be loaded at commit time (if configured), and recommend that 
frequently used caches be configured to be loaded at this time
this can result in slower commit times, but responsive queries as soon as the 
commit is finished

once i also add the option for per-segment caching for sorting and faceting 
(i'll probably put this on by default for sorting, faceting maybe not), this 
will allow full tunability for the end-user

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)
Add OutOfOrderCollector and InOrderCollector subclasses of Collector


 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


I find myself always having to implement these methods, and i always return a 
constant (depending on if the collector can handle out of order hits)

would be nice for these two convenience abstract classes to exist that 
implemented acceptsDocsOutOfOrder() as final and returned the appropriate value


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746907#action_12746907
 ] 

Tim Smith commented on LUCENE-1849:
---

They would be convenience classes for people implementing their own Collectors 
(as i am)

just kinda a pain (and bloats amount of required code by about 5 lines) to have 
to always implement this method (when it could be inherited easily from a 
parent class)

Just throwing this out as an idea to see if anyone else likes it (thats why i 
marked it as a _Wish_)

 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746920#action_12746920
 ] 

Tim Smith commented on LUCENE-1849:
---

People tend to always reformat single line functions like that to use at lest 2 
more lines (i think checkstyle/eclipse formatting will often screw up my 
compact code if someone else ever touches it))
also, you need the extra line for javadoc, so thats always 5 lines :(

I can always add these to classes to my class hierarchy (and i probably will if 
it doesn't get added to lucene's search package)
but i think these are in general useful to anyone implementing collectors

a typical person porting to 2.9 can switch their HitCollector to subclass 
InOrderCollector instead (in order to keep getting docs in order like lucene 
2.4)
This then means they don't need to even think about acceptDocsOutOfOrder() 
semantics unless they really want to
Also one less method to implement incorrectly for us application developers :)

 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746928#action_12746928
 ] 

Tim Smith commented on LUCENE-1849:
---

I like the idea of this flag being private final and initialized via a 
Collector constructor

Collector.acceptDocsOutOfOrder() should then be made final though? (otherwise 
each collector has a boolean flag that may never be used if a subclass 
implements acceptDocsOutOfOrder() its own way)

 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747021#action_12747021
 ] 

Tim Smith commented on LUCENE-1849:
---

bq. Or just make it package private? This flag is only used by oal.search.* to 
mate the right scorer to the collector.

protected instead please, 
Collector subclasses should be able to inspect this value if they want/need to



 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747028#action_12747028
 ] 

Tim Smith commented on LUCENE-1849:
---

will do

 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747039#action_12747039
 ] 

Tim Smith commented on LUCENE-1849:
---

bq. I think this will get pretty messy and complicated.

yeah, this is a bit messy with the chain of inheritance in these classes (as 
each variant is slightly optimized depending on in order/out of order 
collection)

makes me go back to favoring InOrderCollector/OutOfOrderCollector abstract 
classes
or maybe just one AbstractCollector method which implements all methods 
except collect()

like so:
{code}
public abstract class AbstractCollector extends Collector {
  private final boolean allowDocsOutOfOrder;
  protected IndexReader reader;
  protected Scorer scorer;
  protected int docBase;  

  public AbstractCollector() {
this(false);
  }

  public AbstractCollector(boolean allowDocsOutOfOrder) {
this.allowDocsOutOfOrder = allowDocsOutOfOrder;
  }

  public void setNextReader(IndexReader reader, int docBase) {
this.reader = reader;
this.docBase = docBase;
  }

  public void setScorer(Scorer scorer) {
this.scorer = scorer;
  }

  public final boolean acceptsDocsOutOfOrder() {
return allowDocsOutOfOrder;
  }
}
{code}

bq. What exactly are we trying to solve here?
the Collector methodology has grown more complicated (because it does more to 
handle per segment searches)
the HitCollector api was nice and simple

this AbstractCollector (insert better name here) gets things back to being more 
simple
could even hide the Scorer as private and provide a score() method that 
returns the score for the current document, and otherwise simplify this even 
more

subclassing AbstractCollector instead of Collector makes it so most of the 
required common things are done for you
otherwise, every single Collector will do virtually the same thing is done in 
AbstractCollector here (as far as setup/etc)

Again, this is just a _Wish_ which i've thought of as i've been working through 
the new collector API (and found myself doing the exact same thing for every 
implementation of Collector)


 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747046#action_12747046
 ] 

Tim Smith commented on LUCENE-1849:
---

bq. we force them to think a little bit and then do what's best for them

the more you force people to think, the more likely they will come to the wrong 
solution (in my experience)

i love the power of the new Collector API, and i know how to take advantage of 
it to eek out the utmost performance where it matters or is possible.  But with 
some cases, i just want that AbstractCollector because it reduces my code 
complexity for subclasses and does everything i need without me introducing 
duplicated code

Also, the AbstractCollector makes it much easier to create anonymous subclasses 
of Collector (just one method to override) (i hate anonymous subclasses myself, 
but i see them used a lot inside lucene). I know in 2.4 there were tons of 
anonymous HitCollectors 

 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747051#action_12747051
 ] 

Tim Smith commented on LUCENE-1849:
---

i was just proposing AbstractCollector to consolidate the variations of 
abstract subclasses of Collector

I like ScoringCollector
i would also like a NonScoringCollector

in this case, i would recommend both should take the allowDocsOutOfOrder flag 
in their constructors (and store in a private final returned by 
acceptingOutOfOrderDocs())
otherwise, i would still want to see 2 variations on each of ScoringCollector 
and NonScoringCollector to handle the OutOfOrder vs InOrder variations




 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747059#action_12747059
 ] 

Tim Smith commented on LUCENE-1849:
---

I guess the question is: what variations do we provide helper Collector 
implementations for?

seems like there's a bunch of possibilities (depending on how far you go)

thats why i initially proposed AbstractCollector (storing everything that was 
set (IndexReader, Scorer, docBase))
the amount of memory and time used to set 2 pointers and an int per segment 
almost seems irrelevant for this Collector implementation aid (and if you 
really care about those few bytes and cpu cycles, you can directly implement 
Collector)


 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector

2009-08-24 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747075#action_12747075
 ] 

Tim Smith commented on LUCENE-1849:
---

bq. I think we should simply do nothing. This is an expert API.

i'm ok with that

just thought this idea would potentially be of general use for other 
developers, but it probably gets more complex adding all the variations for 
subclasses of Collector and maybe even more confusing that just the raw 
Collector API


 Add OutOfOrderCollector and InOrderCollector subclasses of Collector
 

 Key: LUCENE-1849
 URL: https://issues.apache.org/jira/browse/LUCENE-1849
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 I find myself always having to implement these methods, and i always return a 
 constant (depending on if the collector can handle out of order hits)
 would be nice for these two convenience abstract classes to exist that 
 implemented acceptsDocsOutOfOrder() as final and returned the appropriate 
 value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-23 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746600#action_12746600
 ] 

Tim Smith commented on LUCENE-1821:
---

well, you could go the route similar to the 2.4 TokenStream api (next() vs 
next(Token))

have Filter.getDocIdSet(IndexSearcher, IndexReader) call 
Filter.getDocIdSet(IndexReader), and vice versa by default
one method or the other would be required to be overridden

getDocIdSet(IndexReader) would be deprecated (and removed in 3.0)

Since the deprecated method would be removed in 3.0, and since noone would 
probably be depending on these new semantics right away this should work

Also, in general, QueryWrapperFilter performs a bit worse now in 2.9
this is because it creates an IndexSearcher for every query it wraps (which 
results in doing gatherSubReaders and creating the offsets anew each time 
getDocIdSet(IndexReader) is called
so, the new method with the IndexSearcher also passed in is much better for 
evaluating these Filters


 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-23 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746613#action_12746613
 ] 

Tim Smith commented on LUCENE-1821:
---

bq. thats a tough bunch of code to decide to spread ...
at least it'll be able to go away real soon with 3.0

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-23 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746643#action_12746643
 ] 

Tim Smith commented on LUCENE-1821:
---

Lot of new comments to respond to :)
will try to cover them all

bq. decent comparator (StringOrdValComparator) that operates per segment.

Still, the StringOrdValComparator will have to break down and call 
String.equals() whenever it compars docs in different IndexReaders
It also has to do more maintenance in general than would be needed for just a 
StringOrd comparator that would have a cache across all IndexReaders
While the StringOrdValComparator may be faster in 2.9 than string sorting in 
2.4, its not as fast as it could be if the cache was created on the 
IndexSearcher level
I looked at the new string sorting stuff last week, and it looks pretty smart 
to reduce the number of String.equals() calls needed, but this adds extra 
complexity and will still be reduced to String.equals() calls, which will 
translate to slower sorting than could be possible

bq. one option might be to subclass DirectoryReader 

The idea of this is to disable per segment searching?
I don't actually want to do that. I want to use per segment searching 
functionality to take advantage of caches on per segment basis where possible, 
and map docs to the IndexSearcher context when i can't do per segment caching.

bq. Could you compute the top-level ords, but then break it up per-segment?

I think i see what your getting at here, and i've already thought of this as a 
potential solution. The cache will always need to be created at the top most 
level, but it will be pre-broken out into a per-segment cache whose context is 
the top level IndexSearcher/MultiReader. The biggest problem here is the 
complexity of actually creating such a cache, which i'm sure will translate to 
this cache loading slower (hard to say how much slower without implementing)
I do plan to try this approach, but i expect this will be at least a week or 
two out from now.

I've currently updated my code for this to work per-segment by adding the 
docBase when performing the lookup into this cache (which is per-IndexSearcher)
I did this using my getIndexReaderBase() funciton i added to my subclass of 
IndexSearcher during Scorer construction time (I can live with this, however i 
would like to see getIndexReaderBase() added to IndexSearcher, and the 
IndexSearcher passed to Weight.scorer() so i don't need to hold onto my 
IndexSearcher subclass in my Weight implementation)

bq. just return the virtual per-segment DocIdSet.

Thats what i'm doing now. I use the docid base for the IndexReader, along with 
its maxDoc to have the Scorer represent a virtual slice for just the segment in 
question
The only real problem here is that during Scorer initialization for this i have 
to call fullDocIdSetIter.advance(docBase) in the Scorer constructor. If 
advance(int) for the DocIdSet in question is O(N), this adds an extra penalty 
per segment that did not exist before

bq. his isn't a long-term solution, since the order in which Lucene visits the 
readers isn't in general guaranteed,

that's where IndexSearcher.getIndexReaderBase(IndexReader) comes into play. If 
you call this in your scorer to get the docBase, it doesn't matter what order 
the segments are searched in (as it'll always return the proper base (in the 
context of the IndexSearcher that is))


Here's another potential thought (very rough, haven't consulted code to see how 
feasible this is):
what if Similarity had a method called getDocIdBase(IndexReader)
then, the searcher implementation could wrap the provided Similarity to provide 
the proper calculation
Similarity is always already passed through this chain of Weight creation and 
is passed into the Scorer
Obviously, a Query Implementation can completely drop the passing of the 
Searcher's similarity and drop in its own (but this would mean it doesn't care 
about getting these docid bases)
I think this approach would potentially resolve all MultiSearcher difficulties







 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 

[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-23 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746662#action_12746662
 ] 

Tim Smith commented on LUCENE-1821:
---

can i at least argue for it being tagged for 3.0 or 3.1 (just so it gets looked 
at again prior to the next releases)

I have workarounds for 2.9, so i'm ok with it not getting in then (just want to 
make sure my use cases won't be made impossible in future releases)

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource

2009-08-22 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746450#action_12746450
 ] 

Tim Smith commented on LUCENE-1842:
---

Here's some pseudo code to hopefully fully show this use case:

{code}
// These guys are initialized once
Analyzer analyzer1 = new SimpleAnalyzer();
Analyzer analyzer2 = new StandardAnalyzer();
Analyzer analyzer3 = new LowerCaseAnalyzer();

// This is done on a per Field basis
Reader source1 = new StringReader(some text);
Reader source2 = new StringReader(some more text);
Reader source3 = new stringReader(final text);

TokenStream stream1 = analyzer1.reusableTokenStream(source1);
TokenStream stream2 = analyzer2.reusableTokenStream(source2);
TokenStream stream3 = analyzer3.reusableTokenStream(source3);

// Create the container for the shared attributes map
AttributeSource attrs = new AttributeSource();

// Have all streams share the same attributes map
stream1.reset(attrs);
stream2.reset(attrs);
stream3.reset(attrs);

// Create my merging TokenStream (have it use attrs as its attribute source)
TokenStream merger = new MergeTokenStreams(attrs, new TokenStream[] { stream1, 
stream2, stream3 });

/// Add a filter that will put a token prior to the source token stream, and 
after the source token stream is exhausted
TokenStream finalStream = new WrapFilter(merger, anchor token);

// finalStream will now be passed to the indexer
{code}

Hopefully this makes this use case more clear
In order to use reusableTokenStreams from the Analyzers, the MergeTokenStreams 
must be able to share its attributes map with the underlaying TokenStreams its 
merging
otherwise, MergeTokenStreams has to do something like this in its 
incrementToken:
{code}
public boolean incrementToken() {
 if (currentStream.incrementToken()) {
copy currentStream.termAttr into my local termAttr
copy currentStream.offsetsAttr into my local termAttr
return true;
  } else {
advance currentStream to be the next stream in line
  } 
}
{code}

as opposed to:
{code}
public boolean incrementToken() {
  if (currentStream.incrementToken()) {
// don't need to do anything (because underlying tokenstreams share the 
same attributes map as me)
return true;
  } else {
advance currentStream to be the next stream in line
  }
}
{code}

Hopefully this makes my use case clear

 Add reset(AttributeSource) method to AttributeSource
 

 Key: LUCENE-1842
 URL: https://issues.apache.org/jira/browse/LUCENE-1842
 Project: Lucene - Java
  Issue Type: Wish
  Components: Analysis
Reporter: Tim Smith
Priority: Minor

 Originally proposed in LUCENE-1826
 Proposing the addition of the following method to AttributeSource
 {code}
 public void reset(AttributeSource input) {
 if (input == null) {
   throw new IllegalArgumentException(input AttributeSource must not be 
 null);
 }
 this.attributes = input.attributes;
 this.attributeImpls = input.attributeImpls;
 this.factory = input.factory;
 }
 {code}
 Impacts:
 * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their 
 reset() method, not in their constructor
 * requires making AttributeSource.attributes and 
 AttributeSource.attributesImpl non-final
 Advantages:
 Allows creating only a single actual AttributeSource per thread that can then 
 be used for indexing with a multitude of TokenStream/Tokenizer combinations 
 (allowing utmost reuse of TokenStream/Tokenizer instances)
 this results in only a single attributes/attributesImpl map being 
 required per thread
 addAttribute() calls will almost always return right away (will only be 
 initialized once per thread)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource

2009-08-22 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746452#action_12746452
 ] 

Tim Smith commented on LUCENE-1842:
---

Yes, i know that creating the Tokenizer/TokenStream fully each time will do the 
trick as well, but i was hoping for some way to take advantage of the 
reusableTokenStream concepts (esecially in the case of Tokenizers that take a 
long time to construct (load resources/etc))

what i guess i really want is this method added to Analyzer:
{code}
public TokenStream tokenStream(AttributeSource attrs, Reader reader);
{code}

but i assume this would either have to reconstruct the full TokenStream chain 
every time (could be costly), or it would require 
AttributeSource.reset(AttributeSource) method in order to reuse saved streams




 Add reset(AttributeSource) method to AttributeSource
 

 Key: LUCENE-1842
 URL: https://issues.apache.org/jira/browse/LUCENE-1842
 Project: Lucene - Java
  Issue Type: Wish
  Components: Analysis
Reporter: Tim Smith
Priority: Minor

 Originally proposed in LUCENE-1826
 Proposing the addition of the following method to AttributeSource
 {code}
 public void reset(AttributeSource input) {
 if (input == null) {
   throw new IllegalArgumentException(input AttributeSource must not be 
 null);
 }
 this.attributes = input.attributes;
 this.attributeImpls = input.attributeImpls;
 this.factory = input.factory;
 }
 {code}
 Impacts:
 * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their 
 reset() method, not in their constructor
 * requires making AttributeSource.attributes and 
 AttributeSource.attributesImpl non-final
 Advantages:
 Allows creating only a single actual AttributeSource per thread that can then 
 be used for indexing with a multitude of TokenStream/Tokenizer combinations 
 (allowing utmost reuse of TokenStream/Tokenizer instances)
 this results in only a single attributes/attributesImpl map being 
 required per thread
 addAttribute() calls will almost always return right away (will only be 
 initialized once per thread)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource

2009-08-22 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746455#action_12746455
 ] 

Tim Smith commented on LUCENE-1842:
---

The problem with the MergeAnalyzer is that it requiers multiple Readers as 
input, but i think idea does put me on another (potentially better) track for 
handling sharing the same underlaying AttributeSource for all the merged 
tokenstreams (as well as sharing reusable TokenStreams)

I'll try to put this to the test on monday when i get back to work

 Add reset(AttributeSource) method to AttributeSource
 

 Key: LUCENE-1842
 URL: https://issues.apache.org/jira/browse/LUCENE-1842
 Project: Lucene - Java
  Issue Type: Wish
  Components: Analysis
Reporter: Tim Smith
Priority: Minor

 Originally proposed in LUCENE-1826
 Proposing the addition of the following method to AttributeSource
 {code}
 public void reset(AttributeSource input) {
 if (input == null) {
   throw new IllegalArgumentException(input AttributeSource must not be 
 null);
 }
 this.attributes = input.attributes;
 this.attributeImpls = input.attributeImpls;
 this.factory = input.factory;
 }
 {code}
 Impacts:
 * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their 
 reset() method, not in their constructor
 * requires making AttributeSource.attributes and 
 AttributeSource.attributesImpl non-final
 Advantages:
 Allows creating only a single actual AttributeSource per thread that can then 
 be used for indexing with a multitude of TokenStream/Tokenizer combinations 
 (allowing utmost reuse of TokenStream/Tokenizer instances)
 this results in only a single attributes/attributesImpl map being 
 required per thread
 addAttribute() calls will almost always return right away (will only be 
 initialized once per thread)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource

2009-08-22 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746457#action_12746457
 ] 

Tim Smith commented on LUCENE-1842:
---

I would never use the Merging TokenStream when doing highlighting anyway, 
also, i'm sure i can get the Merging TokenStream to update the offsets to be 
appropriate (based on the merge) -- i never use offsets for anything right now 
anyway (although i may in the future)

and i can't let the indexer do the merging because i want to add additional 
analytics on top of the merge (which can't be done on the sub streams in 
piecemeal fashion)

also, Merging may not be a straight cat, more complex merges may merge sorted 
streams into a final sorted token stream, interleave tokens from sub streams in 
round robin fashion, and so on (the only use i have for it right now is the 
straight cat, however this concept could be applied to more nasty stuff)



 Add reset(AttributeSource) method to AttributeSource
 

 Key: LUCENE-1842
 URL: https://issues.apache.org/jira/browse/LUCENE-1842
 Project: Lucene - Java
  Issue Type: Wish
  Components: Analysis
Reporter: Tim Smith
Priority: Minor

 Originally proposed in LUCENE-1826
 Proposing the addition of the following method to AttributeSource
 {code}
 public void reset(AttributeSource input) {
 if (input == null) {
   throw new IllegalArgumentException(input AttributeSource must not be 
 null);
 }
 this.attributes = input.attributes;
 this.attributeImpls = input.attributeImpls;
 this.factory = input.factory;
 }
 {code}
 Impacts:
 * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their 
 reset() method, not in their constructor
 * requires making AttributeSource.attributes and 
 AttributeSource.attributesImpl non-final
 Advantages:
 Allows creating only a single actual AttributeSource per thread that can then 
 be used for indexing with a multitude of TokenStream/Tokenizer combinations 
 (allowing utmost reuse of TokenStream/Tokenizer instances)
 this results in only a single attributes/attributesImpl map being 
 required per thread
 addAttribute() calls will almost always return right away (will only be 
 initialized once per thread)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745941#action_12745941
 ] 

Tim Smith commented on LUCENE-1821:
---

I'm OK with having to jump through some hoops in order to get back to the full 
index context

It would be nice if this was more facilitated by lucene's API (IMO, this would 
be best handled by adding a Searcher as the first arg to Weight.scorer(), as 
then a Weight will not need to hold on to this (breaking serializable))

There are definitely plenty of use cases that take advantage of the whole 
index (one created by IndexWriter), so this ability should not be removed
I have at least 3 in my application alone (and they are all very important)

You get tradeoffs working Per-Segment vs Per-MultiReader when it comes to 
caching in general
going per-segment means caches load faster, and load less frequently, however 
this causes algorithms working with the caches to be slower (depending on 
algorithm and cache type)

for static boosting from a field value (ValueSource), it makes no difference
for numeric sorting, it makes no difference 

for string sorting, it makes a big difference - you now have to do a bunch of 
String.equals() calls, where you didn't have to in 2.4 (just used the ord index)
Given this reason, you should really be able to do string sorting 2 ways
* using per segment field cache (commit time/first query faster, sort time 
slower)
* using multi-reader field cache (commit time/first query slower, sort time 
faster)

This same argument also goes for features like faceting (not provided by 
lucene, but is provided by applications like solr, and my application). Using a 
per-segment cache will cause some significant performance loss when performing 
faceting, as it requires creating the facets for each segment, and then merging 
them (this results in a good deal of extra object overhead/memory overhead/more 
work where faceting on the multi-reader does not see this)

In the end, it should be up to the application developer to choose what 
strategy works best for them, and their application (fast commits/fast cache 
loading may take a back seat to fast query execution)

In general, i find there is a tradeoff between commit time and query time. The 
more you speed up commit time, the slower query time gets, and vice versa
I just want/need the ability to choose






 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745960#action_12745960
 ] 

Tim Smith commented on LUCENE-1821:
---

bq. You never officially had the full index context
Officially, i didn't not have the full index context either (it was undefined 
at best, but was clear from both lucene code and my use of the API that i did 
have the full index context)

Whenever i do a search, i always explicitly know what context i'm searching in 
(its always an IndexSearcher context)
further, whenever i pass an IndexReader to any method (to create a cache/etc), 
i explicitly know what context i'm dealing with in order to know what the 
docids used mean
as the application developer, i have full control over what i pass into the 
lucene API and where, and know the context of passing that in (javadoc should 
just be fully clear on how what goes in is used (if not already) (i always have 
the option to not use a utility class/method provided by lucene if it does not 
have the proper context semantics i need (and can write my own that does)

bq. The current API would not support this without back compat breaks up the 
wazoo
i kinda see what you mean here, but then how is it ok to pass an IndexReader to 
this method by the same right
it seems like it should be ok to pass the IndexSearcher (the direct context for 
the IndexReader) for the IndexReader in question to Weight.scorer() if its ok 
to pass the IndexReader (the scorer() method's interface was already changed 
between 2.4 and 2.9 (adding allowDocsInOrder and topScorer))

bq. You can pick, but we have to be true to the API or change it (not easy with 
our back compat policies)
be fair, 2.9 has a lot of back compat breaks, both in API and runtime behavior 
(i had tons of compile errors when i dropped 2.9 in, as well as some other 
hacks i had to add in (at least temporarily) in order to get 2.9 to work due to 
run time changes (primarily this per segment search stuff))

I have no problem with back compat breaks in general (only took me about a day 
to absorb 2.9 initially (still working on fully taking advantage of new 
features and getting rid of deprecated class use)) The only requirement i would 
put on a back compat break is that it have a workaround to get back the the 
previous versions behavior (in this case have it possible to remap the docids 
to the IndexSearcher context inside the scorer)



 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745969#action_12745969
 ] 

Tim Smith commented on LUCENE-1826:
---

bq. This is not possible per design. The AttributeSource cannot be changed.
I fully understand why

but...
it should be rather easy to add a reset(AttributeSource input) to 
AttributeSource
{code}
public void reset(AttributeSource input) {
if (input == null) {
  throw new IllegalArgumentException(input AttributeSource must not be 
null);
}
this.attributes = input.attributes;
this.attributeImpls = input.attributeImpls;
this.factory = input.factory;
}
{code}

This would require making attributes and attributeImpls non-final (potentially 
reducing some jvm caching capabilities)

However, this then provides the ability to do even more Attribute reuse
For example, if this method existed, the Indexer could use a ThreadLocal of raw 
AttributeSources (one AttributeSource per thread)
then, prior to calling TokenStream.reset(), it could call 
TokenStream.reset(ThreadLocal AttributeSource)

This would result in all token streams for the same document using the same 
AttributeSource (reusing TermAttribute, etc)

This would require that the no TokenStreams/Filters/Tokenizers call 
addAttribute() in the constructor (they would have to do this in reset())

I totally get that this is a tall order
If you want i can open a separate ticket for this 
(AttributeSource.reset(AttributeSource)) for further consideration



 All Tokenizer implementations should have constructors that take 
 AttributeSource and AttributeFactory
 -

 Key: LUCENE-1826
 URL: https://issues.apache.org/jira/browse/LUCENE-1826
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Assignee: Michael Busch
 Fix For: 2.9


 I have a TokenStream implementation that joins together multiple sub 
 TokenStreams (i then do additional filtering on top of this, so i can't just 
 have the indexer do the merging)
 in 2.4, this worked fine.
 once one sub stream was exhausted, i just started using the next stream 
 however, in 2.9, this is very difficult, and requires copying Term buffers 
 for every token being aggregated
 however, if all the sub TokenStreams share the same AttributeSource, and my 
 concat TokenStream shares the same AttributeSource, this goes back to being 
 very simple (and very efficient)
 So for example, i would like to see the following constructor added to 
 StandardTokenizer:
 {code}
   public StandardTokenizer(AttributeSource source, Reader input, boolean 
 replaceInvalidAcronym) {
 super(source);
 ...
   }
 {code}
 would likewise want similar constructors added to all Tokenizer sub classes 
 provided by lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745979#action_12745979
 ] 

Tim Smith commented on LUCENE-1821:
---

NOTE: if the leaf IndexSearcher were to be passed to scorer(), it would also 
have to be passed to explain()

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745977#action_12745977
 ] 

Tim Smith commented on LUCENE-1821:
---

{quote}
It was an implementation detail. If you look at MultiSearcher, Searchable, 
Searcher and how the API is put together, you can see we don't support that 
type of thing. I think its fairly clear after a little thought.

You can limit your API's to handle just IndexSearchers, but as a project, we 
cannot.
{quote}
I totally understand your resistance here.  I get that i'm really utilizing 
advanced lucene concepts at very low levels (and these are subject to some 
changes that i will have to absorb with new versions)

bq. Its okay to pass the Reader because its a contextless Reader. There is no 
value in also passing a contextless Searcher
well, when you pass the Searcher that contains Reader, the Reader is no longer 
contextless.
also, the context of the Searcher can be fairly well defined (its a leaf 
Searcher. the one that actually called Weight.scorer())

Also, looking a bit more at MultiSearcher semantics, sorting requires this 
leaf Searcher context in order to work already
MultiSearcher just takes the top docs from each underlaying Searchable, adjusts 
the docids to the MultiSearcher Context, and sends them through another 
priority queue
So, this leaf Searcher context concept is required by sorting already. 
I just want my Scorer to be given this leaf context as well

Also, since it is a leaf context, the Weight.scorer() method could have the 
following interface:
{code}
/**
 * @param searcher The IndexSearcher that contains reader.
 */
public Scorer scorer(IndexSearcher searcher, IndexReader reader, boolean 
allowDocsInOrder, boolean topScorer);
{code}

then, with the patch i posted, i could call:
searcher.getIndexReaderBase(reader) 
and i'm all set



 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745988#action_12745988
 ] 

Tim Smith commented on LUCENE-1821:
---

here's what you can do:

{code}
  /** @deprecated use {...@link getDocIdSet(IndexSearcher, IndexReader)} */
  public DocIdSet getDocIdSet(final IndexReader reader) throws IOException {
return getDocIdSet(new IndexSearcher(reader), reader);
  }

  public DocIdSet getDocIdSet(final IndexSearcher searcher, final IndexReader 
reader) {
final Weight weight = query.weight(searcher);
return new DocIdSet() {
  public DocIdSetIterator iterator() throws IOException {
return weight.scorer(searcher, reader, true, false);
  }
};
  }
{code}

and yeah, i'm all for tons warnings in javadoc explicitly defining the contracts


 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745991#action_12745991
 ] 

Tim Smith commented on LUCENE-1821:
---

what class is this getDocIdSet method on (lacking the context of where its used)

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746004#action_12746004
 ] 

Tim Smith commented on LUCENE-1821:
---

Looks like Filter should have another method added getDocIdSet(IndexSearcher 
searcher, IndexReader reader) (deprecating getDocIdSet(IndexReader))

new method would call old method by default (with little harm done in general)
IndexSearcher would call the new getDocIdSet() variant
QueryWrapperFilter would be updated to implement getDocIdSet(IndexSearcher, 
IndexReader) (with old method wrapping IndexReader with an IndexSearcher)
This would actually be cleaner for QueryWrapperFilter, as it wouldn't have to 
create a new IndexSearcher on every call

i definitely see that this is potentially more painful than the changes to the 
scorer() method (question is how many people implement custom Filters?)

Personally, i don't use Filter, so any changes here don't impact me, but to the 
best of my knowledge, i'm not the only one using lucene :)

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1839) Scorer.explain is deprecated but abstract, should have impl that throws UnsupportedOperationException

2009-08-21 Thread Tim Smith (JIRA)
Scorer.explain is deprecated but abstract, should have impl that throws 
UnsupportedOperationException
-

 Key: LUCENE-1839
 URL: https://issues.apache.org/jira/browse/LUCENE-1839
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


Suggest having Scorer implement explain to throw UnsupportedOperationException

right now, i have to implement this method (because its abstract), and javac 
yells at me for overriding a deprecated method

if the following implementation is in Scorer, i can remove my empty 
implementations of explain from my Scorers
{code}
  /** Returns an explanation of the score for a document.
   * brWhen this method is used, the {...@link #next()}, {...@link 
#skipTo(int)} and
   * {...@link #score(HitCollector)} methods should not be used.
   * @param doc The document number for the explanation.
   *
   * @deprecated Please use {...@link IndexSearcher#explain}
   * or {...@link Weight#explain} instead.
   */
  public Explanation explain(int doc) throws IOException {
throw new UnsupportedOperationException();
  }
{code}

best i figure, this shouldn't break back compat (people already have to 
recompile anyway) (2.9 definitely not binary compatible with 2.4)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746263#action_12746263
 ] 

Tim Smith commented on LUCENE-1821:
---

I started integrating the per-segment searching (removed my hack that was doing 
searching on MultiReader)

In order to get my query implementations to work, i had to hold onto my 
Searcher in the Weight constructor and add getIndexReaderBase() method to my 
IndexSearcher implementation, and this seems to be working well

I had 3 query implementations that were affected:
one used a cache that will be easy to create per segment (will have this use a 
per segment cache as soon as i can)
one used an int[] ord index (the underlaying cache cannot be made per segment)
one used a cached DocIdSet created over the top level MultiReader (should be 
able to have a DocIdSet per Segment reader here, but this will take some more 
thinking (source of the matching docids is from a separate index), will also 
need to know which sub docidset to use based on which IndexReader is passed to 
scorer() - shouldn't be any big deal)

i'm a bit concerned that i may not be testing multi-segment searching quite 
properly right now though since i think most of my indexes being tested only 
have one segment.
On that topic, if i create a subclass of LogByteSizeMergePolicy and return null 
from findMerges() and findMergesToExpungeDeletes() will this guarantee that 
segments will only be merged if i explicitly optimize? In which case, i can 
just pepper in some commits as i add documents to guarantee that i have more 
than 1 segment.

Overall, i am really liking the per-segment stuff, and the Collector API in 
general 
its already made it possible to optimize a good deal of things away (like 
calling Scorer.score() for docs that end up getting filtered away), however i 
hit some deoptimization due to some of the crazy stuff i had to do to make 
those 3 query implementations work, but this should only really be isolated to 
one of the implementations (and i can hopefully reoptimize those cases anyway)

I would still like to see IndexSearcher passed to Weight.scorer(), and the 
getIndexReaderBase() method added to IndexSearcher
This will clean up my current hacks to map docids 




 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9

 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746356#action_12746356
 ] 

Tim Smith commented on LUCENE-1826:
---

i'll fork off another ticket for the reset(AttributeSource) method


 All Tokenizer implementations should have constructors that take 
 AttributeSource and AttributeFactory
 -

 Key: LUCENE-1826
 URL: https://issues.apache.org/jira/browse/LUCENE-1826
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Assignee: Michael Busch
 Fix For: 2.9

 Attachments: lucene-1826.patch


 I have a TokenStream implementation that joins together multiple sub 
 TokenStreams (i then do additional filtering on top of this, so i can't just 
 have the indexer do the merging)
 in 2.4, this worked fine.
 once one sub stream was exhausted, i just started using the next stream 
 however, in 2.9, this is very difficult, and requires copying Term buffers 
 for every token being aggregated
 however, if all the sub TokenStreams share the same AttributeSource, and my 
 concat TokenStream shares the same AttributeSource, this goes back to being 
 very simple (and very efficient)
 So for example, i would like to see the following constructor added to 
 StandardTokenizer:
 {code}
   public StandardTokenizer(AttributeSource source, Reader input, boolean 
 replaceInvalidAcronym) {
 super(source);
 ...
   }
 {code}
 would likewise want similar constructors added to all Tokenizer sub classes 
 provided by lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746360#action_12746360
 ] 

Tim Smith commented on LUCENE-1826:
---

forked off the reset(AttributeSource) to LUCENE-1842

 All Tokenizer implementations should have constructors that take 
 AttributeSource and AttributeFactory
 -

 Key: LUCENE-1826
 URL: https://issues.apache.org/jira/browse/LUCENE-1826
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Assignee: Michael Busch
 Fix For: 2.9

 Attachments: lucene-1826.patch


 I have a TokenStream implementation that joins together multiple sub 
 TokenStreams (i then do additional filtering on top of this, so i can't just 
 have the indexer do the merging)
 in 2.4, this worked fine.
 once one sub stream was exhausted, i just started using the next stream 
 however, in 2.9, this is very difficult, and requires copying Term buffers 
 for every token being aggregated
 however, if all the sub TokenStreams share the same AttributeSource, and my 
 concat TokenStream shares the same AttributeSource, this goes back to being 
 very simple (and very efficient)
 So for example, i would like to see the following constructor added to 
 StandardTokenizer:
 {code}
   public StandardTokenizer(AttributeSource source, Reader input, boolean 
 replaceInvalidAcronym) {
 super(source);
 ...
   }
 {code}
 would likewise want similar constructors added to all Tokenizer sub classes 
 provided by lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource

2009-08-21 Thread Tim Smith (JIRA)
Add reset(AttributeSource) method to AttributeSource


 Key: LUCENE-1842
 URL: https://issues.apache.org/jira/browse/LUCENE-1842
 Project: Lucene - Java
  Issue Type: Wish
  Components: Analysis
Reporter: Tim Smith
 Fix For: 2.9


Originally proposed in LUCENE-1826

Proposing the addition of the following method to AttributeSource
{code}
public void reset(AttributeSource input) {
if (input == null) {
  throw new IllegalArgumentException(input AttributeSource must not be 
null);
}
this.attributes = input.attributes;
this.attributeImpls = input.attributeImpls;
this.factory = input.factory;
}
{code}

Impacts:
* requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their 
reset() method, not in their constructor
* requires making AttributeSource.attributes and AttributeSource.attributesImpl 
non-final

Advantages:
Allows creating only a single actual AttributeSource per thread that can then 
be used for indexing with a multitude of TokenStream/Tokenizer combinations 
(allowing utmost reuse of TokenStream/Tokenizer instances)
this results in only a single attributes/attributesImpl map being required 
per thread
addAttribute() calls will almost always return right away (will only be 
initialized once per thread)







-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource

2009-08-21 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746380#action_12746380
 ] 

Tim Smith commented on LUCENE-1842:
---

bq. still pay the price for filling the two hashmaps and the cache lookups. 
this would only ever be incurred once per thread (if the same root 
AttributeSource was always used)
the cache lookups would still need to be done at TokenStream.reset() time, 
however they would pretty much always get a hit

the main use case this proposal supports is as follows:

i have a TokenStream that merges multiple sub token streams (i call this out in 
LUCENE-1826)
in order to do this really efficiently, all sub token streams need to share the 
same AttributeSource
then, the merging TokenStream can just iterate through its sub streams, 
calling incrementToken() to consume all tokens from each stream

without the ability to reset the sub streams AttributeSource to the same 
AttributeSource used by this merging TokenStream, the you have to copy the 
attributes from the sub streams as you iterate 
furthermore, the sub TokenStreams could potentially be any TokenStream (or 
chain of TokenStreams rooted with a Tokenizer)
without the reset(AttributeSource) method, i would have to create the 
TokenStream chain anew for every merging TokenStream (or do the attribute 
copying approach)


 Add reset(AttributeSource) method to AttributeSource
 

 Key: LUCENE-1842
 URL: https://issues.apache.org/jira/browse/LUCENE-1842
 Project: Lucene - Java
  Issue Type: Wish
  Components: Analysis
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 Originally proposed in LUCENE-1826
 Proposing the addition of the following method to AttributeSource
 {code}
 public void reset(AttributeSource input) {
 if (input == null) {
   throw new IllegalArgumentException(input AttributeSource must not be 
 null);
 }
 this.attributes = input.attributes;
 this.attributeImpls = input.attributeImpls;
 this.factory = input.factory;
 }
 {code}
 Impacts:
 * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their 
 reset() method, not in their constructor
 * requires making AttributeSource.attributes and 
 AttributeSource.attributesImpl non-final
 Advantages:
 Allows creating only a single actual AttributeSource per thread that can then 
 be used for indexing with a multitude of TokenStream/Tokenizer combinations 
 (allowing utmost reuse of TokenStream/Tokenizer instances)
 this results in only a single attributes/attributesImpl map being 
 required per thread
 addAttribute() calls will almost always return right away (will only be 
 initialized once per thread)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-20 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745423#action_12745423
 ] 

Tim Smith commented on LUCENE-1821:
---

I can work up another patch where the Searcher is passed into Weight.scorer() 
as well if that is an acceptable approach (this method was already changed alot 
in 2.9 anyway)

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader

2009-08-20 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745436#action_12745436
 ] 

Tim Smith commented on LUCENE-1821:
---

true, MultiSearcher does kink things up some (and the Searcher abstract class 
in general)

personally, this is not a problem for me (don't use MultiSearcher (not yet at 
least)), and i'm happy with being passed the IndexSearcher instance that 
directly contains the IndexReader i'm being passed

The contract could be marked that the Searcher provided is the direct container 
of the IndexReader also passed
at which point, both explain() and scorer() would be accurate in terms of this

I would almost like to see something different passed in instead of a 
Searcher/IndexReader pair

i would actually like to see a SearchContext sort of object passed in
this would represent the whole tree of Searchers/IndexReaders
this would allow access to the MultiSearcher, the direct IndexSearcher, and the 
sub IndexReader (which should actually be used for the scoring) (as well as any 
other Searcher's in the call stack) 
this SearchContext could also pass in the topScorer/allowDocsInOrder flags 
(but that would be more difficult as scorers have subscorers that need to 
sometimes be created with different flags for these), but this SearchContext 
could be used to pass more information throughout the Scorer API in general 
from the top level (like - always use constant score queries where possible, 
use scoring algorithm X, Y, or Z, and so on)

obviously this would impact the API of Searcher a good deal as it would have to 
maintain this stack as sub Searcher's search() methods are called)

 Weight.scorer() not passed doc offset for sub reader
 --

 Key: LUCENE-1821
 URL: https://issues.apache.org/jira/browse/LUCENE-1821
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Tim Smith
 Attachments: LUCENE-1821.patch


 Now that searching is done on a per segment basis, there is no way for a 
 Scorer to know the actual doc id for the document's it matches (only the 
 relative doc offset into the segment)
 If using caches in your scorer that are based on the entire index (all 
 segments), there is now no way to index into them properly from inside a 
 Scorer because the scorer is not passed the needed offset to calculate the 
 real docid
 suggest having Weight.scorer() method also take a integer for the doc offset
 Abstract Weight class should have a constructor that takes this offset as 
 well as a method to get the offset
 All Weights that have sub weights must pass this offset down to created 
 sub weights
 Details on workaround:
 In order to work around this, you must do the following:
 * Subclass IndexSearcher
 * Add int getIndexReaderBase(IndexReader) method to your subclass
 * during Weight creation, the Weight must hold onto a reference to the passed 
 in Searcher (casted to your sub class)
 * during Scorer creation, the Scorer must be passed the result of 
 YourSearcher.getIndexReaderBase(reader)
 * Scorer can now rebase any collected docids using this offset
 Example implementation of getIndexReaderBase():
 {code}
 // NOTE: more efficient implementation can be done if you cache the result if 
 gatherSubReaders in your constructor
 public int getIndexReaderBase(IndexReader reader) {
   if (reader == getReader()) {
 return 0;
   } else {
 List readers = new ArrayList();
 gatherSubReaders(readers);
 Iterator iter = readers.iterator();
 int maxDoc = 0;
 while (iter.hasNext()) {
   IndexReader r = (IndexReader)iter.next();
   if (r == reader) {
 return maxDoc;
   } 
   maxDoc += r.maxDoc();
 } 
   }
   return -1; // reader not in searcher
 }
 {code}
 Notes:
 * This workaround makes it so you cannot serialize your custom Weight 
 implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1825) AttributeSource.getAttribute() should throw better IllegalArgumentException

2009-08-20 Thread Tim Smith (JIRA)
AttributeSource.getAttribute() should throw better IllegalArgumentException
---

 Key: LUCENE-1825
 URL: https://issues.apache.org/jira/browse/LUCENE-1825
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor


when seting use only new API for TokenStream, i received the following 
exception:

{code}
   [junit] Caused by: java.lang.IllegalArgumentException: This AttributeSource 
does not have the attribute 'interface 
org.apache.lucene.analysis.tokenattributes.TermAttribute'.
[junit] at 
org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:249)
[junit] at 
org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252)
[junit] at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:145)
[junit] at 
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244)
[junit] at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772)
[junit] at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755)
[junit] at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2613)
{code}

However, i can't actually see the culprit that caused this exception

suggest that the IllegalArgumentException include getClass().getName() in 
order to be able to identify which TokenStream implementation actually caused 
this


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1825) AttributeSource.getAttribute() should throw better IllegalArgumentException

2009-08-20 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745502#action_12745502
 ] 

Tim Smith commented on LUCENE-1825:
---

Looked a little closer on this and it looks like if the root TokenStream does 
not addAttribute() for all attributes expected by the indexer, this exception 
occurs

I suppose if the Indexer called addAttribute() instead of getAttribute() this 
wouldn't happen (attributes not provided by TokenStream, but required by 
Indexer would be initialized at index time (and would remain empty))

 AttributeSource.getAttribute() should throw better IllegalArgumentException
 ---

 Key: LUCENE-1825
 URL: https://issues.apache.org/jira/browse/LUCENE-1825
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 when seting use only new API for TokenStream, i received the following 
 exception:
 {code}
[junit] Caused by: java.lang.IllegalArgumentException: This 
 AttributeSource does not have the attribute 'interface 
 org.apache.lucene.analysis.tokenattributes.TermAttribute'.
 [junit]   at 
 org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:249)
 [junit]   at 
 org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252)
 [junit]   at 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:145)
 [junit]   at 
 org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244)
 [junit]   at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772)
 [junit]   at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755)
 [junit]   at 
 org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2613)
 {code}
 However, i can't actually see the culprit that caused this exception
 suggest that the IllegalArgumentException include getClass().getName() in 
 order to be able to identify which TokenStream implementation actually caused 
 this

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1825) AttributeSource.getAttribute() should throw better IllegalArgumentException

2009-08-20 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745508#action_12745508
 ] 

Tim Smith commented on LUCENE-1825:
---

Updated getAttribute() on AttributeSource as follows to find the source of my 
pain:
{code}
  /**
   * The caller must pass in a Classlt;? extends Attributegt; value. 
   * Returns the instance of the passed in Attribute contained in this 
AttributeSource
   * 
   * @throws IllegalArgumentException if this AttributeSource does not contain 
the
   * Attribute
   */
  public AttributeImpl getAttribute(Class attClass) {
AttributeImpl att = (AttributeImpl) this.attributes.get(attClass);
if (att == null) {
  throw new IllegalArgumentException(getClass().getName() +  does not have 
the attribute ' + attClass + '.);
}

return att;
  }
{code}

I see that this could end up being an arbitrary 
org.apache.lucene.util.AttributeSource though if you aren't fully integrating 
the new api


 AttributeSource.getAttribute() should throw better IllegalArgumentException
 ---

 Key: LUCENE-1825
 URL: https://issues.apache.org/jira/browse/LUCENE-1825
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 when seting use only new API for TokenStream, i received the following 
 exception:
 {code}
[junit] Caused by: java.lang.IllegalArgumentException: This 
 AttributeSource does not have the attribute 'interface 
 org.apache.lucene.analysis.tokenattributes.TermAttribute'.
 [junit]   at 
 org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:249)
 [junit]   at 
 org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252)
 [junit]   at 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:145)
 [junit]   at 
 org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244)
 [junit]   at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772)
 [junit]   at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755)
 [junit]   at 
 org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2613)
 {code}
 However, i can't actually see the culprit that caused this exception
 suggest that the IllegalArgumentException include getClass().getName() in 
 order to be able to identify which TokenStream implementation actually caused 
 this

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1826) All Tokenizer implementations should have constructor that takes an AttributeSource

2009-08-20 Thread Tim Smith (JIRA)
All Tokenizer implementations should have constructor that takes an 
AttributeSource
---

 Key: LUCENE-1826
 URL: https://issues.apache.org/jira/browse/LUCENE-1826
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith


I have a TokenStream implementation that joins together multiple sub 
TokenStreams (i then do additional filtering on top of this, so i can't just 
have the indexer do the merging)

in 2.4, this worked fine.
once one sub stream was exhausted, i just started using the next stream 

however, in 2.9, this is very difficult, and requires copying Term buffers for 
every token being aggregated

however, if all the sub TokenStreams share the same AttributeSource, and my 
concat TokenStream shares the same AttributeSource, this goes back to being 
very simple (and very efficient)


So for example, i would like to see the following constructor added to 
StandardTokenizer:
{code}
  public StandardTokenizer(AttributeSource source, Reader input, boolean 
replaceInvalidAcronym) {
super(source);
...
  }
{code}

would likewise want similar constructors added to all Tokenizer sub classes 
provided by lucene


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructor that takes an AttributeSource

2009-08-20 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745523#action_12745523
 ] 

Tim Smith commented on LUCENE-1826:
---

i'll do that from now on (feel free to boot them if you feel necessary (didn't 
want to overstep my bounds suggesting fix in 2.9))

 All Tokenizer implementations should have constructor that takes an 
 AttributeSource
 ---

 Key: LUCENE-1826
 URL: https://issues.apache.org/jira/browse/LUCENE-1826
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9


 I have a TokenStream implementation that joins together multiple sub 
 TokenStreams (i then do additional filtering on top of this, so i can't just 
 have the indexer do the merging)
 in 2.4, this worked fine.
 once one sub stream was exhausted, i just started using the next stream 
 however, in 2.9, this is very difficult, and requires copying Term buffers 
 for every token being aggregated
 however, if all the sub TokenStreams share the same AttributeSource, and my 
 concat TokenStream shares the same AttributeSource, this goes back to being 
 very simple (and very efficient)
 So for example, i would like to see the following constructor added to 
 StandardTokenizer:
 {code}
   public StandardTokenizer(AttributeSource source, Reader input, boolean 
 replaceInvalidAcronym) {
 super(source);
 ...
   }
 {code}
 would likewise want similar constructors added to all Tokenizer sub classes 
 provided by lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1826) All Tokenizer implementations should have constructor that takes an AttributeSource

2009-08-20 Thread Tim Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-1826:
--

Fix Version/s: 2.9

 All Tokenizer implementations should have constructor that takes an 
 AttributeSource
 ---

 Key: LUCENE-1826
 URL: https://issues.apache.org/jira/browse/LUCENE-1826
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9


 I have a TokenStream implementation that joins together multiple sub 
 TokenStreams (i then do additional filtering on top of this, so i can't just 
 have the indexer do the merging)
 in 2.4, this worked fine.
 once one sub stream was exhausted, i just started using the next stream 
 however, in 2.9, this is very difficult, and requires copying Term buffers 
 for every token being aggregated
 however, if all the sub TokenStreams share the same AttributeSource, and my 
 concat TokenStream shares the same AttributeSource, this goes back to being 
 very simple (and very efficient)
 So for example, i would like to see the following constructor added to 
 StandardTokenizer:
 {code}
   public StandardTokenizer(AttributeSource source, Reader input, boolean 
 replaceInvalidAcronym) {
 super(source);
 ...
   }
 {code}
 would likewise want similar constructors added to all Tokenizer sub classes 
 provided by lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1825) AttributeSource.getAttribute() should throw better IllegalArgumentException

2009-08-20 Thread Tim Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-1825:
--

Fix Version/s: 2.9

 AttributeSource.getAttribute() should throw better IllegalArgumentException
 ---

 Key: LUCENE-1825
 URL: https://issues.apache.org/jira/browse/LUCENE-1825
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor
 Fix For: 2.9


 when seting use only new API for TokenStream, i received the following 
 exception:
 {code}
[junit] Caused by: java.lang.IllegalArgumentException: This 
 AttributeSource does not have the attribute 'interface 
 org.apache.lucene.analysis.tokenattributes.TermAttribute'.
 [junit]   at 
 org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:249)
 [junit]   at 
 org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252)
 [junit]   at 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:145)
 [junit]   at 
 org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244)
 [junit]   at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772)
 [junit]   at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755)
 [junit]   at 
 org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2613)
 {code}
 However, i can't actually see the culprit that caused this exception
 suggest that the IllegalArgumentException include getClass().getName() in 
 order to be able to identify which TokenStream implementation actually caused 
 this

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructor that takes an AttributeSource

2009-08-20 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745530#action_12745530
 ] 

Tim Smith commented on LUCENE-1826:
---

NOTE: for me, this is just a nice to have

I currently only use my concat TokenStream on my own TokenStream 
implementations right now (so i can do this manually on my own TokenStream 
Impls)

however i would like to be able to directly use lucene Tokenizers under my 
concat TokenStream under some situations in the future

 All Tokenizer implementations should have constructor that takes an 
 AttributeSource
 ---

 Key: LUCENE-1826
 URL: https://issues.apache.org/jira/browse/LUCENE-1826
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
 Fix For: 2.9


 I have a TokenStream implementation that joins together multiple sub 
 TokenStreams (i then do additional filtering on top of this, so i can't just 
 have the indexer do the merging)
 in 2.4, this worked fine.
 once one sub stream was exhausted, i just started using the next stream 
 however, in 2.9, this is very difficult, and requires copying Term buffers 
 for every token being aggregated
 however, if all the sub TokenStreams share the same AttributeSource, and my 
 concat TokenStream shares the same AttributeSource, this goes back to being 
 very simple (and very efficient)
 So for example, i would like to see the following constructor added to 
 StandardTokenizer:
 {code}
   public StandardTokenizer(AttributeSource source, Reader input, boolean 
 replaceInvalidAcronym) {
 super(source);
 ...
   }
 {code}
 would likewise want similar constructors added to all Tokenizer sub classes 
 provided by lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   >