[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857375#action_12857375 ] Tim Smith commented on LUCENE-2324: --- bq. But... could we allow an add/updateDocument call to express this affinity, explicitly? i would love to be able to explicitly define a segment affinity for documents i'm feeding this would then allow me to say: all docs from table a has affinity 1 all docs from table b has affinity 2 this would ideally result in indexing documents from each table into a different segment (obviously, i would then need to be able to have segment merging be affinity aware so optimize/merging would only merge segments that share an affinity) Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 Attachments: lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857385#action_12857385 ] Tim Smith commented on LUCENE-2324: --- bq. Probably if you really want to keep the segments segregated like that, you should in fact index to separate indices? Thats what i'm currently thinking i'll have to do however it would be ideal if i could either subclass IndexWriter or use IndexWriter directly with this affinity concept (potentially writing my own segment merger that is affinity aware) that makes it so i can easily use near real time indexing, as only one IndexWriter will be in the mix, as well as make managing deletes and a whole other host of issues with multiple indexes disappear Also makes it so i can configure memory settings across all affinity groups instead of having to dynamically create them, each with their own memory bounds Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 Attachments: lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2071) Allow updating of IndexWriter SegmentReaders
[ https://issues.apache.org/jira/browse/LUCENE-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851388#action_12851388 ] Tim Smith commented on LUCENE-2071: --- +1 I have a special subclassed IndexSearcher that certain special queries require, so IndexWriter's delete by query will fail as an IndexSearcher is passed in With this added method, i would be able to construct my own Searcher over the readers and then apply deletes properly This would also allow counting the deletes as they occur as well (which is commonly desired when deleting by query) It would be nice if this method would also work with non-pooled readers so my desired method signature would be: void updateReaders(Readers callback, boolean pooled) if the readers were already pooled, this would have no effect, otherwise it would just open the segment readers just like the non-pooled delete readers are opened Allow updating of IndexWriter SegmentReaders Key: LUCENE-2071 URL: https://issues.apache.org/jira/browse/LUCENE-2071 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-2071.patch This discussion kind of started in LUCENE-2047. Basically, we'll allow users to perform delete document, and norms updates on SegmentReaders that are handled by IndexWriter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2071) Allow updating of IndexWriter SegmentReaders
[ https://issues.apache.org/jira/browse/LUCENE-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851528#action_12851528 ] Tim Smith commented on LUCENE-2071: --- found a couple of small issues with the patch attached to this ticket: 1. applyDeletes issue saw this was in another ticket think the flush should be flush(true, true, false) and applyDeletes() should be called in the synchronized block 2. IndexWriter.changeCount not updated the call() method does not return a boolean indicating if there were any changes that would need to be committed as a result, if no other changes are made to the indexwriter, the commit will be skipped, even though deletes/norm updates were sent in IndexReader.reopen() will then return the old reader without the deletes/norms Allow updating of IndexWriter SegmentReaders Key: LUCENE-2071 URL: https://issues.apache.org/jira/browse/LUCENE-2071 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-2071.patch This discussion kind of started in LUCENE-2047. Basically, we'll allow users to perform delete document, and norms updates on SegmentReaders that are handled by IndexWriter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850127#action_12850127 ] Tim Smith commented on LUCENE-2345: --- bq. I think we should only commit this only on 3.1 (new feature)? 3.1 only of course (just posted a 3.0 patch now as that's what i'm using and i need the functionality now) bq. Tim, do you think the plugin model (extension by composition) would be workable for your use case? Ie, instead of a factory enabling subclasses of SegmentReader? As long as the plugin model allows the same capabilities, that could work just fine and could be the final solution for this ticket. I mainly need the ability to add data structures to a SegmentReader that will be shared for all SegmentReader's for a segment, and then add some extra meta information on a per instance basis Is there a ticket or wiki page that details the plugin architecture/design so i could take a look? However, would the plugins allow overriding specific IndexReader methods? I still would see the need to be able to override specific methods for a SegmentReader (in order to track statistics/provide changed/different/faster/more feature rich implementations) I don't have a direct need for this right now, however i could envision needing this in the future Here's a few requirements i would pose for the plugin model (maybe they are already though of): * Plugins have hooks to reopen themselves (some plugins can be shared across all instances of a SegmentReader) ** These reopen hooks would be called during SegmentReader.reopen() * Plugins are initialized during SegmentReader.get/SegmentReader.reopen ** plugins should not have to be added after the fact, as this would not allow proper warming/initializing of plugins inside the NRT indexing ** i assume this would need be added as some list of PluginFactories added to the IndexWriter/IndexReader.open()? * Plugins should have a close method that is called in SegmentReader.close() ** This will allow proper release of any resources * Plugins are passed an instance of the SegmentReader they are for ** Plugins should be able to access all methods on a SegmentReader ** This would effectively allow overriding a SegmentReader by having a plugin provide the functionality instead (however only people explicitly calling the plugin would get this benefit) Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-2345: -- Attachment: LUCENE-2345_3.0.plugins.patch Here's a patch (again, against 3.0) showing the minimal API i would like to see from the plugin model Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850323#action_12850323 ] Tim Smith commented on LUCENE-2345: --- found one issue with the plugins patch With NRT indexing, if the SegmentReader is opened with no TermInfosReader (for merging), then the plugins will be initialized with a SegmentReader that has no ability to walk the TermsEnum. I guess SegmentPlugin initialization should wait until after the terms index is loaded or have another method for catching this event to the SegmentPlugin interface Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850361#action_12850361 ] Tim Smith commented on LUCENE-2345: --- bq. My patch removes loadTermsIndex method from SegmentReader and requires you to reopen it. that's definitely much cleaner and would solve the issue in my current patch (sadly i'm on 3.0 and want to keep my patch there at a minimum until i can port to all the goodness on 3.1). bq. Also, they extend not only SegmentReader, but the whole hierarchy - SR, MR, DR, whatever. i just wussed out and just did only the SegmentReader case as thats all i need right now bq. as all the hooks are on the factory classes could you post your factory class interface? If i base my 3.0 patch off that i can reduce my 3.1 port overhead. are there any tickets tracking your reopen refactors or your plugin model? If not, feel free to retool this ticket for your plugin model for Index Readers as that will solve my use cases (and then some) Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-2345: -- Attachment: LUCENE-2345_3.0.patch Here's a patch against 3.0 that provides the SegmentReaderFactory ability (not tested yet, but i'll be doing that shortly as i integrate this functionality) It adds a SegmentReaderFactory. The IndexWriter now has a getter and setter for setting this SegmentReader has a new protected method init() which is called after the segment reader has been initialized (to allow subclasses to hook this action and do additional initialization, etc added 2 new IndexReader.open() calls that allow specifying the SegmentReaderFactory Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849731#action_12849731 ] Tim Smith commented on LUCENE-2345: --- that was my plan Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2345) Make it possible to subclass SegmentReader
Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849358#action_12849358 ] Tim Smith commented on LUCENE-1821: --- This would actually be solved by LUCENE-2345 for me as i would then be able to tag SegmentReaders with any additional accounting information i would need Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849455#action_12849455 ] Tim Smith commented on LUCENE-2345: --- that's the reassurance i needed :) will start working on a patch tomorrow will take a few days as i'll start with a 3.0 patch (which i use), then will create a 3.1 patch once i've got that all flushed out Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849497#action_12849497 ] Tim Smith commented on LUCENE-2345: --- i'll do my initial work on 3.0 so i can absorb the changes now and will post that patch at which point, i can wait for you to finish whatever you need, or we can just incorporate the same ability into your patch for the other ticket i would just like to see the ability to subclass SegmentReader's on 3.1 so i don't have to port a patch when i absorb 3.1 (just use the finalized apis) Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity
[ https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844930#action_12844930 ] Tim Smith commented on LUCENE-2310: --- Personally, i like keeping Fieldable, (or having AbstractField just with abstract methods and no actual implementation) for feeding documents, i use custom Fieldable implementations to reduce amount of setters called, as Fields of different types have different constant settings Reduce Fieldable, AbstractField and Field complexity Key: LUCENE-2310 URL: https://issues.apache.org/jira/browse/LUCENE-2310 Project: Lucene - Java Issue Type: Sub-task Components: Index Reporter: Chris Male Attachments: LUCENE-2310-Deprecate-AbstractField.patch In order to move field type like functionality into its own class, we really need to try to tackle the hierarchy of Fieldable, AbstractField and Field. Currently AbstractField depends on Field, and does not provide much more functionality that storing fields, most of which are being moved over to FieldType. Therefore it seems ideal to try to deprecate AbstractField (and possible Fieldable), moving much of the functionality into Field and FieldType. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839682#action_12839682 ] Tim Smith commented on LUCENE-2283: --- i haven't been able to fully replicate this issue in a unit test scenario, however it will definitely resolve that 40M of ram that was allocated and never released for the RAMFiles on the StoredFieldsWriter (keeping that bound to the configured memory size) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2283.patch, LUCENE-2283.patch, LUCENE-2283.patch StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838976#action_12838976 ] Tim Smith commented on LUCENE-2283: --- I'll work up another patch might take me a few minutes to get my head wrapped around the TermVectorsTermsWriter stuff Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2283.patch StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-2283: -- Attachment: LUCENE-2283.patch Here's a new patch with your suggestions Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2283.patch, LUCENE-2283.patch StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-2283: -- Attachment: LUCENE-2283.patch Here's a patch for using a pool for stored fields buffers Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2283.patch StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837793#action_12837793 ] Tim Smith commented on LUCENE-2283: --- I came across this issue looking for a reported memory leak during indexing a yourkit snapshot showed that the PerDocs for an IndexWriter were using ~40M of memory (at which point i came across this potentially unbounded memory use in StoredFieldsWriter) this snapshot seems more or less at a stable point (memory grows but then returns to a normal state), however i have reports that eventually the memory is completely exhausted resulting in out of memory errors. I so far have not found any other major culprit in the lucene indexing code. This index receives a routine mix of very large and very small documents (which would explain this situation) The VM and system have more than ample amount of memory given the buffer size and what should be normal indexing RAM requirements. Also, a major difference between this leak not occurring and it showing up is that previously, the IndexWriter was closed when performing commits, now the IndexWriter remains open (just calling IndexWriter.commit()). So, if any memory is leaking during indexing, it is no longer being reclaimed during commit. As a side note, closing the index writer at commit time would sometimes fail, resulting in some following updates to fail because the index writer was locked and couldn't be reopened until the old index writer was garbage collected, so i don't want to go back to this for commits. Its possible there is a leak somewhere else (i currently do not have a snapshot right before out of memory issues occur, so currently the only thing that stands out is the PerDoc memory use) As far as a fix goes, wouldn't it be better to have the RAMFile's used for stored fields pull and return byte buffers from the byte block pool on the DocumentsWriter? This would allow the memory to be reclaimed based on the index writers buffer size (otherwise there is no configurable way to tune this memory use) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837821#action_12837821 ] Tim Smith commented on LUCENE-2283: --- ramBufferSizeMB is 64MB Here's the yourkit breakdown per class: * DocumentsWriter - 256 MB ** TermsHash - 38.7 MB ** StoredFieldsWriter - 37.5 MB ** DocumentsWriterThreadState - 36.2 MB ** DocumentsWriterThreadState - 34.6 MB ** DocumentsWriterThreadState - 33.8 MB ** DocumentsWriterThreadState - 27.5 MB ** DocumentsWriterThreadState - 13.4 MB I'm starting to dig into the ThreadStates now to see if anything stands out here bq. Hmm, that makes me nervous, because I think in this case the use should be bounded. I should be getting a new profile dump at crash time soon, so hopefully that will make things clearer bq. That doesn't sound good! Can you post some details on this (eg an exception)? If i recall correctly, I think the exception was caused by an out of disk space situation (which would recover) obviously not much that can be done about this other than adding more disk space, however the situation would recover, but docs would be lost in the interum bq. But, anyway, keeping the same IW open and just calling commit is (should be) fine. Yeah, this should be the way to go, especially as it results in the pooled buffers not needing to be reallocated/reclaimed/etc, however right now this is the only change i can currently think of that could result in memory issues. bq. Yes, that's a great solution - a single pool. But that's a somewhat bigger change. Seems like this would be the best approach as it makes the memory bounded by the configuration of the engine, giving better reuse of byte blocks and better ability to reclaim memory (in DocumentsWriter.balanceRAM()) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837875#action_12837875 ] Tim Smith commented on LUCENE-2283: --- bq. I agree. I'll mull over how to do it... unless you're planning on consing up a patch I'd love to, but don't have the free cycles at the moment :( bq. How many threads do you pass through IW? honestly don't 100% know about the origin of the threads i'm given In general, they should be from a static pool, but may be dynamically allocated if the static pool runs out One thought i had recently was to control this more tightly by having a limited number of static threads that called IndexWriter methods in case that was the issue (but that would be a pretty big change) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837881#action_12837881 ] Tim Smith commented on LUCENE-2283: --- latest profile dump has pointed out a non-lucene issue as causing some memory growth so feel free to drop down priority however it seems like using the bytepool for the stored fields would be good overall Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837919#action_12837919 ] Tim Smith commented on LUCENE-2283: --- another note is that this was on 64 bit vm i've noticed that all the memsize calculations assume 4 byte pointers, so perhaps that can lead to more memory being used that would otherwise be expected (although 256 MB is still well over the 2X mem use that would potentially be expected in that case) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838017#action_12838017 ] Tim Smith commented on LUCENE-2283: --- i'm working up a patch for the shared byteblock pool for stored field buffers (found a few cycles) Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2276) Add IndexReader.document(int, Document, FieldSelector)
Add IndexReader.document(int, Document, FieldSelector) -- Key: LUCENE-2276 URL: https://issues.apache.org/jira/browse/LUCENE-2276 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Tim Smith The Document object passed in would be populated with the fields identified by the FieldSelector for the specified internal document id This method would allow reuse of Document objects when retrieving stored fields from the index -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790803#action_12790803 ] Tim Smith commented on LUCENE-1923: --- added getName() in case anyone is currently relying on current (default) output from toString() on index readers feel free to rename the getName() methods to toString() Add toString() or getName() method to IndexReader - Key: LUCENE-1923 URL: https://issues.apache.org/jira/browse/LUCENE-1923 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Assignee: Michael McCandless Attachments: LUCENE-1923.patch It would be very useful for debugging if IndexReader either had a getName() method, or a toString() implementation that would get a string identification for the reader. for SegmentReader, this would return the same as getSegmentName() for Directory readers, this would return the generation id? for MultiReader, this could return something like multi(sub reader name, sub reader name, sub reader name, ...) right now, i have to check instanceof for SegmentReader, then call getSegmentName(), and for all other IndexReader types, i would have to do something like get the IndexCommit and get the generation off it (and this may throw UnsupportedOperationException, at which point i have would have to recursively walk sub readers and try again) I could work up a patch if others like this idea -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787472#action_12787472 ] Tim Smith commented on LUCENE-1923: --- i won't have the time till after the new year. if someone else wants to work up a patch, go for it (this seems simple enough and adds some nice info capabilities for logging/etc), otherwise, i'll get to it when i can Add toString() or getName() method to IndexReader - Key: LUCENE-1923 URL: https://issues.apache.org/jira/browse/LUCENE-1923 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith It would be very useful for debugging if IndexReader either had a getName() method, or a toString() implementation that would get a string identification for the reader. for SegmentReader, this would return the same as getSegmentName() for Directory readers, this would return the generation id? for MultiReader, this could return something like multi(sub reader name, sub reader name, sub reader name, ...) right now, i have to check instanceof for SegmentReader, then call getSegmentName(), and for all other IndexReader types, i would have to do something like get the IndexCommit and get the generation off it (and this may throw UnsupportedOperationException, at which point i have would have to recursively walk sub readers and try again) I could work up a patch if others like this idea -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1923) Add toString() or getName() method to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1923: -- Attachment: LUCENE-1923.patch Here's a simple patch to get the ball rolling This adds a getName() method to IndexReader the default implementation will be: SimleClassName(subreader.getName(), subreader.getName(), ...) SegmentReader will return same value as getSegmentName() DirectoryReader will return: DirectoryReader(segment_N, segment.getName(), segment.getName(), ...) ParallelReader will return: ParallelReader(parallelReader1.getName(), parallelReader2.getName(), ...) this currently does not have a toString() implementation return getName() do with this patch as you will Add toString() or getName() method to IndexReader - Key: LUCENE-1923 URL: https://issues.apache.org/jira/browse/LUCENE-1923 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Attachments: LUCENE-1923.patch It would be very useful for debugging if IndexReader either had a getName() method, or a toString() implementation that would get a string identification for the reader. for SegmentReader, this would return the same as getSegmentName() for Directory readers, this would return the generation id? for MultiReader, this could return something like multi(sub reader name, sub reader name, sub reader name, ...) right now, i have to check instanceof for SegmentReader, then call getSegmentName(), and for all other IndexReader types, i would have to do something like get the IndexCommit and get the generation off it (and this may throw UnsupportedOperationException, at which point i have would have to recursively walk sub readers and try again) I could work up a patch if others like this idea -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786921#action_12786921 ] Tim Smith commented on LUCENE-1859: --- close if you like application writers can add guards for this if they like/need to as a custom TokenFilter mainly created this ticket as this can result in an unbound buffer should people use the token stream api incorrectly (or against suggestions of lucene core developers) TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781615#action_12781615 ] Tim Smith commented on LUCENE-2086: --- Got some performance numbers: Description of test (NOTE: this is representative of actions that may occur in a running system (not a contrived test)): * feed 4 million operations (3/4 are deletes, 1/4 are updates (single field)) * commit * feed 1 million updates (about 1/3 are updates, 2/3/ deletes (randomly selected)) * commit Numbers: || Desc || Old || New || | feed 4 million | 56914ms | 15698ms | | commit 4 million | 9072ms | 14291ms | | total (4 million) | 65986ms | 29989ms | | update 1 million | 46096ms | 11340ms | | commit 1 million | 13501ms | 9273ms | | total (1 million) | 59597ms | 20613ms | This shows significant improvements with new patched data (1/3 the time for 1 million, about 1/2 the time for initial 4 million feed) This means i'm gonna definitely need to incorporate this patch while i'm still on 3.0 (will upgrade to 3.0 as soon as its out, then apply this fix) Ideally, a 3.0.1 would be forthcoming in the next month or so with this fix so i wouldn't have to maintain this patched overlay of code When resolving deletes, IW should resolve in term sort order Key: LUCENE-2086 URL: https://issues.apache.org/jira/browse/LUCENE-2086 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2086.patch See java-dev thread IndexWriter.updateDocument performance improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780698#action_12780698 ] Tim Smith commented on LUCENE-2086: --- any chance this can go into 3.0.0 or a 3.0.1? When resolving deletes, IW should resolve in term sort order Key: LUCENE-2086 URL: https://issues.apache.org/jira/browse/LUCENE-2086 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2086.patch See java-dev thread IndexWriter.updateDocument performance improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780701#action_12780701 ] Tim Smith commented on LUCENE-2086: --- i've seen the deletes dominating commit time quite often, so obviously it would be very useful to be able to absorb this optimization sooner than later (whats the timeframe for 3.1?) otherwise i'll have to override the classes involved and pull in this patch (never like this approach myself) When resolving deletes, IW should resolve in term sort order Key: LUCENE-2086 URL: https://issues.apache.org/jira/browse/LUCENE-2086 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2086.patch See java-dev thread IndexWriter.updateDocument performance improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780710#action_12780710 ] Tim Smith commented on LUCENE-2086: --- bq. maybe try it report back? i'll see if i can find some cycles to try this against the most painful use case i have bq. I'd rather see us release a 3.1 sooner rather than later, instead. yes please. I would definitely like to see a more accelerated release cycle (even if less functionality gets into each minor release) When resolving deletes, IW should resolve in term sort order Key: LUCENE-2086 URL: https://issues.apache.org/jira/browse/LUCENE-2086 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2086.patch See java-dev thread IndexWriter.updateDocument performance improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
[ https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777008#action_12777008 ] Tim Smith commented on LUCENE-1909: --- I have the following use case: i have a configuration bean, this bean can be customized via xml at config time in this bean, i expose the setting for the terms index divisor so, my bean has to have a default value for this, right now, i just use 1 for the default value. would be nice if i could just use the lucene constant instead of using 1, as the lucene constant could change in the future (not really likely, but its one less constant i have to maintain) if the default is not made public i have 2 options: # use a hard coded constant in my code for the default value (doing this right now) # use an Integer object, and have null be the default the nasty part about the second option is that i now have to do conditional opening of the reader depending on if null is the value (unset), when it would be much simpler (and easier for me to maintain), if i just always pass in that value Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public --- Key: LUCENE-1909 URL: https://issues.apache.org/jira/browse/LUCENE-1909 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Uwe Schindler Priority: Trivial Fix For: 3.0 Attachments: LUCENE_1909.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
[ https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777064#action_12777064 ] Tim Smith commented on LUCENE-1909: --- users can see the live setting via things like JMX/admin ui also, if i intend users to actually change the value regularly, i can provide user facing documentation that would go into detail without the user needing to dig further into lucene internals (memory tuning guide or something) currently just exposing this setting myself as a SUPER ADVANCED setting (just in case it will need to be tuned for custom use cases in the future) (can't tune it if its not exposed in config) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public --- Key: LUCENE-1909 URL: https://issues.apache.org/jira/browse/LUCENE-1909 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Uwe Schindler Priority: Trivial Fix For: 3.0 Attachments: LUCENE_1909.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
[ https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777109#action_12777109 ] Tim Smith commented on LUCENE-1909: --- what you describe requires effectively 2 settings: * custom term infos divisor enabled/disabled * configured value if enabled this then results in more complexity in opening the index reader (conditional opening where a non-conditional open with the configured divisor would do the trick) any admin ui would also require more conditional handling of displaying this setting (as you described) (i'm not displaying it other than in JMX now anyway, so it doesn't really matter for me, and JMX just has a readonly attribute that shows the configured value (1 by default)) personally, i don't care too much if this constant is made public or not (would make it so i use that constant instead of defining my own with the same value), so it only saves me 1 line (and its not like the default will ever change from 1 in the lucene code anyway) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public --- Key: LUCENE-1909 URL: https://issues.apache.org/jira/browse/LUCENE-1909 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Uwe Schindler Priority: Trivial Fix For: 3.0 Attachments: LUCENE_1909.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
[ https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777118#action_12777118 ] Tim Smith commented on LUCENE-1909: --- Only thing i would want the constant for is to known what the default divisor is. The default just happens to be 1 (no divisor/off). However (while unlikely) a new version of lucene could default to using a real divisor (maybe once everyone is on solid state disks, a higher divisor will result in the same speed of access, with less memory use), at which point, if i upgrade to a new version of lucene, i want to inherit that changed setting (as the default was selected by people that probably know better than me what will better server the general use of lucene in terms of memory and performance) right now, if i want to inherit the default i would have to do a conditional IndexReader.open() and store my setting as a pair (enabled/disabled, divisor), which could be encoded in an Integer object (null = disabled/use lucene default) if the constant is made public, its easier for me to inherit that default setting. of course at the end of the day, either approach will only be about 5 lines of code difference, so again, i don't really care too much about the outcome of this bq. By the way, if you use a final constant, without recompiling it would never change,... I never drop a new lucene in without recompiling (so that doesn't cause any difference for me) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public --- Key: LUCENE-1909 URL: https://issues.apache.org/jira/browse/LUCENE-1909 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Uwe Schindler Priority: Trivial Fix For: 3.0 Attachments: LUCENE_1909.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1909) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public
[ https://issues.apache.org/jira/browse/LUCENE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777136#action_12777136 ] Tim Smith commented on LUCENE-1909: --- bq. If you want to inherit the setting, use the correct constructor agreed, just a tiny bit of more complexity on my side for that (but its so insignificant that it doesn't really matter, and is really not even worth arguing over) if the constant was public, i'd use it, if not, no worries (for me at least) bq. By the default the feature is off. You can't inherit anything about it. ideally, i want to inherit that the feature is off by default, then allow config to turn it on (by providing a value greater than one for this setting, or just 1 to allow config to explicitly disable) using the constructor with no divisor does this (i just need to call the constructor conditionally depending on if the setting was explicitly configured), thats easy and is no problem to do at all (just a couple of extra lines of code in my app layer) Make IndexReader.DEFAULT_TERMS_INDEX_DIVISOR public --- Key: LUCENE-1909 URL: https://issues.apache.org/jira/browse/LUCENE-1909 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Uwe Schindler Priority: Trivial Fix For: 3.0 Attachments: LUCENE_1909.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1923) Add toString() or getName() method to IndexReader
Add toString() or getName() method to IndexReader - Key: LUCENE-1923 URL: https://issues.apache.org/jira/browse/LUCENE-1923 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith It would be very useful for debugging if IndexReader either had a getName() method, or a toString() implementation that would get a string identification for the reader. for SegmentReader, this would return the same as getSegmentName() for Directory readers, this would return the generation id? for MultiReader, this could return something like multi(sub reader name, sub reader name, sub reader name, ...) right now, i have to check instanceof for SegmentReader, then call getSegmentName(), and for all other IndexReader types, i would have to do something like get the IndexCommit and get the generation off it (and this may throw UnsupportedOperationException, at which point i have would have to recursively walk sub readers and try again) I could work up a patch if others like this idea -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758717#action_12758717 ] Tim Smith commented on LUCENE-1923: --- I'll work up a patch that will do the following: add getName() method to IndexReader (and all subclasses (SegmentReader, DirectoryReader, MultiReader, any others i'm not currently aware of that i track down) have toString() return indexreaderclassname(getName()) so, toString for a SegmentReader will look something like: org.apache.lucene.index.SegmentReader(_ae) for a DirectoryReader, it'll look like: org.apache.lucene.index.DirectoryReader(segments_7) Add toString() or getName() method to IndexReader - Key: LUCENE-1923 URL: https://issues.apache.org/jira/browse/LUCENE-1923 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith It would be very useful for debugging if IndexReader either had a getName() method, or a toString() implementation that would get a string identification for the reader. for SegmentReader, this would return the same as getSegmentName() for Directory readers, this would return the generation id? for MultiReader, this could return something like multi(sub reader name, sub reader name, sub reader name, ...) right now, i have to check instanceof for SegmentReader, then call getSegmentName(), and for all other IndexReader types, i would have to do something like get the IndexCommit and get the generation off it (and this may throw UnsupportedOperationException, at which point i have would have to recursively walk sub readers and try again) I could work up a patch if others like this idea -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757199#action_12757199 ] Tim Smith commented on LUCENE-1821: --- I've been playing with per-segment caches for the last couple of weeks and have got everything working pretty well However, i have to end up doing a lot of mapping between an IndexReader instance, and the index into the IndexReader[] array of the IndexSearcher this then allows me to easily get the proper document offset where needed, and/or get a handle on the proper per-segment cache/evaluation object/etc For my use cases, it would be much easier if the following methods were available: on Weight: {code} // readerId is the i in the for (int i = 0; i readers.length; ++i) in IndexSearcher // NOTE: that readerId is at the IndexSearcher level, not the MultiSearcher level public Scorer scorer(IndexReader reader, int readerId, boolean inOrder, boolean topLevel); {code} on Collector: {code} public void setNextReader(IndexReader reader, int docBase, int readerId); // NOTE: this isn't extremely needed, as its easier to get the readerId from docBase (using a cached int[] of docbases for the searcher) {code} I suppose i could use the fact that these methods will always be called in order, keeping and incrementing counter, however the javadoc explicitly says that these methods may be called out of segment order to be more efficient in the future. It would therefore be very useful if these indexes were passed into these methods. To work around this, my searcher currently has a getReaderIdForReader() method very similar to my earlier proposed getIndexReaderBase() method Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1915) Add static openInput(File,...) methods to all FSDirectory implementations
Add static openInput(File,...) methods to all FSDirectory implementations - Key: LUCENE-1915 URL: https://issues.apache.org/jira/browse/LUCENE-1915 Project: Lucene - Java Issue Type: Wish Components: Store Reporter: Tim Smith It would be really useful if NIOFSDirectory and MMapDirectory had static methods for opening an input for arbitrary Files SimpleFSDirectory should likewise have a static openInput(File) method in order to cover all basis (right now, SimpleFSIndexInput only has protected access This allows creating a custom FSDirectory implementation that can use any criteria desired to determine what Input implementation to use for opening a file. I know the FileSwitchDirectory provides some ability to do this, however that locks the selection criteria down to only the file extension in use also, the FileSwitchDirectory approach seems to want to have each directory at different paths (as list() methods just cat the directory listings of the sub directories, which could cause havoc if both sub directories point to the same FS path?) opening up these static openInput() methods would allow creating a custom FS store implementation that would could for instance mmap files of a particular type and size and use NIO for other files, and mabye even use the SimpleFS input for a third category of files. Could also then apply different buffer sizes to different files, perform RAM caching of particular inputs, etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748071#action_12748071 ] Tim Smith commented on LUCENE-1859: --- b1. The worst-case scenario seems kind of theoretical 100% agree, but even if one extremely large token gets added to the stream (and possibly dropped prior to indexing), the char[] grows without ever shrinking back (so it can result in memory usage growing if bad content is thrown in (and people have no shortage of bad content) bq. Is a priority of major justified? major is just the default priority (feel free to change) bq. I assume that, based on this report, TermAttributeImpl never gets reset or discarded/recreated over the course of an indexing session? using reusable TokenStream will never cause the buffer to be nulled (as far as i can tell) for the lifetime of the thread (please correct me if i'm wrong on this) i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing this to be statically updated), just as a means to bound the max memory used here currently, the memory use is bounded by Integer.MAX_VALUE (which is really big) If someone feeds a large text document with no spaces or other delimiting characters, a non-intelligent tokenizer would treat this a 1 big token (and grow the char[] accordingly) TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748071#action_12748071 ] Tim Smith edited comment on LUCENE-1859 at 8/26/09 11:31 AM: - bq. The worst-case scenario seems kind of theoretical 100% agree, but even if one extremely large token gets added to the stream (and possibly dropped prior to indexing), the char[] grows without ever shrinking back (so it can result in memory usage growing if bad content is thrown in (and people have no shortage of bad content) bq. Is a priority of major justified? major is just the default priority (feel free to change) bq. I assume that, based on this report, TermAttributeImpl never gets reset or discarded/recreated over the course of an indexing session? using reusable TokenStream will never cause the buffer to be nulled (as far as i can tell) for the lifetime of the thread (please correct me if i'm wrong on this) i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing this to be statically updated), just as a means to bound the max memory used here currently, the memory use is bounded by Integer.MAX_VALUE (which is really big) If someone feeds a large text document with no spaces or other delimiting characters, a non-intelligent tokenizer would treat this a 1 big token (and grow the char[] accordingly) was (Author: tsmith): b1. The worst-case scenario seems kind of theoretical 100% agree, but even if one extremely large token gets added to the stream (and possibly dropped prior to indexing), the char[] grows without ever shrinking back (so it can result in memory usage growing if bad content is thrown in (and people have no shortage of bad content) bq. Is a priority of major justified? major is just the default priority (feel free to change) bq. I assume that, based on this report, TermAttributeImpl never gets reset or discarded/recreated over the course of an indexing session? using reusable TokenStream will never cause the buffer to be nulled (as far as i can tell) for the lifetime of the thread (please correct me if i'm wrong on this) i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing this to be statically updated), just as a means to bound the max memory used here currently, the memory use is bounded by Integer.MAX_VALUE (which is really big) If someone feeds a large text document with no spaces or other delimiting characters, a non-intelligent tokenizer would treat this a 1 big token (and grow the char[] accordingly) TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748082#action_12748082 ] Tim Smith commented on LUCENE-1859: --- bq. which non-intelligent tokenizers are you referring to? nearly all the lucene tokenizers have 255 as a limit. perhaps this is a non-issue with regards to lucene tokenizers however, Tokenizers can be implemented by anyone (not sure if there are adequate warnings about keeping tokens short) it also may not be possible to keep tokens short, i may need to index a rather long id string in a TokenStream fashion which will grow the buffer without reclaiming this perhaps it should be the responsibility of the Tokenizer to shrink the TermBuffer if it adds long tokens (but this will probably require some helper methods) TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748077#action_12748077 ] Tim Smith commented on LUCENE-1859: --- bq. I would set this to minor and would not take care before 2.9. i would agree with this just reported the issue as it has the potential to cause memory issues (and would think something should be done about it (in the long term at least)) also, the AttributeSource stuff does result in TermAttributeImpl being held onto pretty much forever if using a reusableTokenStream (correct?) was't a new Token() by the indexer for each doc/field in 2.4?, so the unbounding would only last at most for the duration of indexing that one document? with Attribute caching in the TokenStream, the bounding lasts the duration of the TokenStream now (or its underlaying AttributeSource), which could remain until shutdown TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748091#action_12748091 ] Tim Smith commented on LUCENE-1859: --- i fail to see the complexity of adding one method to TermAttribute: {code} public void shrinkBuffer(int maxSize) { if ((maxSize termLength) (buffer.length maxSize)) { termBuffer = new char[maxSize]; } } {code} Not having this is fine as long as its well documented that emitting large tokens can and will result in memory growing uncontrolled (especially if using many indexing threads) TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748103#action_12748103 ] Tim Smith commented on LUCENE-1859: --- bq. Death by a thousand cuts. This is one cut. by this logic, nothing new can ever be added. The thing that brought this to my attention was the new TokenStream API (one cut (rather big, but i like the new API so i'm happy with the blood loss (makes me dizzy and happy))) The new TokenStream API holds onto theses char[] much longer (if not forever), so this results in memory growing unbounded unless there is some facility to truncate/null out the char[] bq. I wouldn't even add the note to the documentation. I don't believe there is ever any valid argument against adding documentation. If someone can shoot themselves in the foot with the gun you gave them, at least tell them not to point the gun at their foot with the safety off. bq. The only reason to do this is to keep average memory usage down for the hell of it. keeping average memory usage down prevents those wonderful OutOfMemory Exceptions (which are difficult at best to recover from) TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748122#action_12748122 ] Tim Smith commented on LUCENE-1859: --- On documentation: any warnings/precautions should always be called out (calling out the external link (wiki/etc) for in depth details) in depth descriptions of the details can be pushed off to wiki pages or external references, as long as a link is provided for the curious, but i would still argue that they should exist bq. this doesn't prevent the OOM, it just makes it less likely all you can ever do for OOM issues is make them less likely (short of just fixing a bug that holds onto memory like mad). If accepting arbitrary content, there will always be a possibility of the content forcing OOM issues. In general, everything possible should be done to reduce the likelyhood of such OOM issues where possible (IMO). TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747441#action_12747441 ] Tim Smith commented on LUCENE-1849: --- bq. If we were to provide a default in Collector, it should be a simple constant, not a variable. in that case, it may be useful to have this method return false by default (expecting docs in order, as this is the default in 2.4) Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747587#action_12747587 ] Tim Smith commented on LUCENE-1849: --- bq. I would prefer not to make a default here, ie, force an explicit choice, because it is an expert API. very reasonable bq. BooleanQuery gets sizable gains in performance if you let it return docs out of order. Any stats on the performance gains here available? didn't see any on a cursory glance through javadoc Also, are the implications of out of order docids coming back from nextDoc() well documented (javadoc?, wiki?)? I guess out of order docids really screw up advance(int), so you should never call advance(int) if you allowed out of order collection for a Scorer? Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747645#action_12747645 ] Tim Smith commented on LUCENE-1849: --- bq. Out-of-order scoring is only used for top-scorers today in Lucene I see that FilteredQuery passes scoreDocsInOrder down to its sub query Is this incorrect? seems like this could cause problems as FilteredQuery does call nextDoc/advance on its sub query (which could be out of order because of this) Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746809#action_12746809 ] Tim Smith commented on LUCENE-1821: --- bq. Actually sorting (during collection) already gives you the docBase so shouldn't your app already have the context needed for this? Yes, i get the docbase and all during collection, so doing sorting with a top level cache will be no problem. I was mainly using sorting as an example of some of the pain caused by per-segment searching/caches (the Collector API makes it easy enough to do sorting on the top level or per segment, so i'm not concerned about integration here) For my app, i plan to allow sorting to be either per-segment or top-level in order to allow people to choose thier poison: faster commit/less memory vs faster sorting I also plan to do faceting likewise certain features will always require a top-level cache (but those are advanced features anyway and should be expected to have impacts on commit time/first search time) bq. Hmm... is advance in fact costly for your DocIdSets? Think how costly it would be to do advance for the SortedVInt DocIdSet (linear search over compressed values) for a bitset, this is instantaneous, but to conserve memory, its better to use a sorted int[] (or the SortedVInt stuff 2.9 provides) in the end, i plan to bucketize the collected docs per segment, so in the end this should hopefully be less of an issue nice thing about that approach is that i can have a bitset for one segment (lost of matches in this segment) and a very small int[] for a different segment based on the matches per segment. Biggest difficulty is doing the mapping to the per-segment DocIdSet (which will probably have to be slower) bq. this one method would allow you to not have to subclass IndexSearcher. I already have to subclass index searcher (i do a lot of extra stuff) however, the IndexSearcher doesn't provide any protected access to its sub readers and doc starts, so i have to do this myself in my subclass's constructor (in the same way IndexSearcher is doing this I would really like to see getIndexReaderBase() added to 2.9's IndexSearcher I would also like to see the subreaders and docstarts either made protected or given protected accessor methods (so i don't have to recreate the same set of sub readers (and make sure i do this the same way for future versions of lucene) Would also be nice to see a protected constructor on IndexSearcher like so: {code} protected IndexSearcher(IndexReader reader, IndexReader[] subReaders, int[] docStarts) { ... } {code} This would allow creating temporary IndexSearchers much faster (don't need to gather sub readers) This would allow: * easily creating IndexSearcher that is top-level (subReaders[] would be length 1 and just contain reader) * create a temporary IndexSearcher off another IndexSearcher that contains some short lived context (i have this use case) Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList();
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746839#action_12746839 ] Tim Smith commented on LUCENE-1821: --- bq. Have you done any benching here? I think we actually found that even most sorting cases were faster than in 2.4.1. I haven't done any benchmarking. I'm not arguing that 2.9 string sorting is slower than 2.4 string sorting, it may well be faster for every case. per segment searching and other improvements potentially added more gains in performance than the new string sorting added losses in performance. But, i can say rather confidently, that a large index with a bunch of segments will result in string sorting being slower when using a per segment string sort cache instead of a full index sort cache (think worst case using \*:\* query) bq. loading a field cache off a multi-segment index was dog slow this is a trade off. slower cache loading in order to get faster sorting i plan to provide the ability to do both, and allow specific use cases to decide what is best for them Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746842#action_12746842 ] Tim Smith commented on LUCENE-1821: --- I allow caches to be loaded at commit time (if configured), and recommend that frequently used caches be configured to be loaded at this time this can result in slower commit times, but responsive queries as soon as the commit is finished once i also add the option for per-segment caching for sorting and faceting (i'll probably put this on by default for sorting, faceting maybe not), this will allow full tunability for the end-user Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746907#action_12746907 ] Tim Smith commented on LUCENE-1849: --- They would be convenience classes for people implementing their own Collectors (as i am) just kinda a pain (and bloats amount of required code by about 5 lines) to have to always implement this method (when it could be inherited easily from a parent class) Just throwing this out as an idea to see if anyone else likes it (thats why i marked it as a _Wish_) Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746920#action_12746920 ] Tim Smith commented on LUCENE-1849: --- People tend to always reformat single line functions like that to use at lest 2 more lines (i think checkstyle/eclipse formatting will often screw up my compact code if someone else ever touches it)) also, you need the extra line for javadoc, so thats always 5 lines :( I can always add these to classes to my class hierarchy (and i probably will if it doesn't get added to lucene's search package) but i think these are in general useful to anyone implementing collectors a typical person porting to 2.9 can switch their HitCollector to subclass InOrderCollector instead (in order to keep getting docs in order like lucene 2.4) This then means they don't need to even think about acceptDocsOutOfOrder() semantics unless they really want to Also one less method to implement incorrectly for us application developers :) Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746928#action_12746928 ] Tim Smith commented on LUCENE-1849: --- I like the idea of this flag being private final and initialized via a Collector constructor Collector.acceptDocsOutOfOrder() should then be made final though? (otherwise each collector has a boolean flag that may never be used if a subclass implements acceptDocsOutOfOrder() its own way) Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747021#action_12747021 ] Tim Smith commented on LUCENE-1849: --- bq. Or just make it package private? This flag is only used by oal.search.* to mate the right scorer to the collector. protected instead please, Collector subclasses should be able to inspect this value if they want/need to Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747028#action_12747028 ] Tim Smith commented on LUCENE-1849: --- will do Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747039#action_12747039 ] Tim Smith commented on LUCENE-1849: --- bq. I think this will get pretty messy and complicated. yeah, this is a bit messy with the chain of inheritance in these classes (as each variant is slightly optimized depending on in order/out of order collection) makes me go back to favoring InOrderCollector/OutOfOrderCollector abstract classes or maybe just one AbstractCollector method which implements all methods except collect() like so: {code} public abstract class AbstractCollector extends Collector { private final boolean allowDocsOutOfOrder; protected IndexReader reader; protected Scorer scorer; protected int docBase; public AbstractCollector() { this(false); } public AbstractCollector(boolean allowDocsOutOfOrder) { this.allowDocsOutOfOrder = allowDocsOutOfOrder; } public void setNextReader(IndexReader reader, int docBase) { this.reader = reader; this.docBase = docBase; } public void setScorer(Scorer scorer) { this.scorer = scorer; } public final boolean acceptsDocsOutOfOrder() { return allowDocsOutOfOrder; } } {code} bq. What exactly are we trying to solve here? the Collector methodology has grown more complicated (because it does more to handle per segment searches) the HitCollector api was nice and simple this AbstractCollector (insert better name here) gets things back to being more simple could even hide the Scorer as private and provide a score() method that returns the score for the current document, and otherwise simplify this even more subclassing AbstractCollector instead of Collector makes it so most of the required common things are done for you otherwise, every single Collector will do virtually the same thing is done in AbstractCollector here (as far as setup/etc) Again, this is just a _Wish_ which i've thought of as i've been working through the new collector API (and found myself doing the exact same thing for every implementation of Collector) Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747046#action_12747046 ] Tim Smith commented on LUCENE-1849: --- bq. we force them to think a little bit and then do what's best for them the more you force people to think, the more likely they will come to the wrong solution (in my experience) i love the power of the new Collector API, and i know how to take advantage of it to eek out the utmost performance where it matters or is possible. But with some cases, i just want that AbstractCollector because it reduces my code complexity for subclasses and does everything i need without me introducing duplicated code Also, the AbstractCollector makes it much easier to create anonymous subclasses of Collector (just one method to override) (i hate anonymous subclasses myself, but i see them used a lot inside lucene). I know in 2.4 there were tons of anonymous HitCollectors Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747051#action_12747051 ] Tim Smith commented on LUCENE-1849: --- i was just proposing AbstractCollector to consolidate the variations of abstract subclasses of Collector I like ScoringCollector i would also like a NonScoringCollector in this case, i would recommend both should take the allowDocsOutOfOrder flag in their constructors (and store in a private final returned by acceptingOutOfOrderDocs()) otherwise, i would still want to see 2 variations on each of ScoringCollector and NonScoringCollector to handle the OutOfOrder vs InOrder variations Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747059#action_12747059 ] Tim Smith commented on LUCENE-1849: --- I guess the question is: what variations do we provide helper Collector implementations for? seems like there's a bunch of possibilities (depending on how far you go) thats why i initially proposed AbstractCollector (storing everything that was set (IndexReader, Scorer, docBase)) the amount of memory and time used to set 2 pointers and an int per segment almost seems irrelevant for this Collector implementation aid (and if you really care about those few bytes and cpu cycles, you can directly implement Collector) Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1849) Add OutOfOrderCollector and InOrderCollector subclasses of Collector
[ https://issues.apache.org/jira/browse/LUCENE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747075#action_12747075 ] Tim Smith commented on LUCENE-1849: --- bq. I think we should simply do nothing. This is an expert API. i'm ok with that just thought this idea would potentially be of general use for other developers, but it probably gets more complex adding all the variations for subclasses of Collector and maybe even more confusing that just the raw Collector API Add OutOfOrderCollector and InOrderCollector subclasses of Collector Key: LUCENE-1849 URL: https://issues.apache.org/jira/browse/LUCENE-1849 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 I find myself always having to implement these methods, and i always return a constant (depending on if the collector can handle out of order hits) would be nice for these two convenience abstract classes to exist that implemented acceptsDocsOutOfOrder() as final and returned the appropriate value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746600#action_12746600 ] Tim Smith commented on LUCENE-1821: --- well, you could go the route similar to the 2.4 TokenStream api (next() vs next(Token)) have Filter.getDocIdSet(IndexSearcher, IndexReader) call Filter.getDocIdSet(IndexReader), and vice versa by default one method or the other would be required to be overridden getDocIdSet(IndexReader) would be deprecated (and removed in 3.0) Since the deprecated method would be removed in 3.0, and since noone would probably be depending on these new semantics right away this should work Also, in general, QueryWrapperFilter performs a bit worse now in 2.9 this is because it creates an IndexSearcher for every query it wraps (which results in doing gatherSubReaders and creating the offsets anew each time getDocIdSet(IndexReader) is called so, the new method with the IndexSearcher also passed in is much better for evaluating these Filters Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746613#action_12746613 ] Tim Smith commented on LUCENE-1821: --- bq. thats a tough bunch of code to decide to spread ... at least it'll be able to go away real soon with 3.0 Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746643#action_12746643 ] Tim Smith commented on LUCENE-1821: --- Lot of new comments to respond to :) will try to cover them all bq. decent comparator (StringOrdValComparator) that operates per segment. Still, the StringOrdValComparator will have to break down and call String.equals() whenever it compars docs in different IndexReaders It also has to do more maintenance in general than would be needed for just a StringOrd comparator that would have a cache across all IndexReaders While the StringOrdValComparator may be faster in 2.9 than string sorting in 2.4, its not as fast as it could be if the cache was created on the IndexSearcher level I looked at the new string sorting stuff last week, and it looks pretty smart to reduce the number of String.equals() calls needed, but this adds extra complexity and will still be reduced to String.equals() calls, which will translate to slower sorting than could be possible bq. one option might be to subclass DirectoryReader The idea of this is to disable per segment searching? I don't actually want to do that. I want to use per segment searching functionality to take advantage of caches on per segment basis where possible, and map docs to the IndexSearcher context when i can't do per segment caching. bq. Could you compute the top-level ords, but then break it up per-segment? I think i see what your getting at here, and i've already thought of this as a potential solution. The cache will always need to be created at the top most level, but it will be pre-broken out into a per-segment cache whose context is the top level IndexSearcher/MultiReader. The biggest problem here is the complexity of actually creating such a cache, which i'm sure will translate to this cache loading slower (hard to say how much slower without implementing) I do plan to try this approach, but i expect this will be at least a week or two out from now. I've currently updated my code for this to work per-segment by adding the docBase when performing the lookup into this cache (which is per-IndexSearcher) I did this using my getIndexReaderBase() funciton i added to my subclass of IndexSearcher during Scorer construction time (I can live with this, however i would like to see getIndexReaderBase() added to IndexSearcher, and the IndexSearcher passed to Weight.scorer() so i don't need to hold onto my IndexSearcher subclass in my Weight implementation) bq. just return the virtual per-segment DocIdSet. Thats what i'm doing now. I use the docid base for the IndexReader, along with its maxDoc to have the Scorer represent a virtual slice for just the segment in question The only real problem here is that during Scorer initialization for this i have to call fullDocIdSetIter.advance(docBase) in the Scorer constructor. If advance(int) for the DocIdSet in question is O(N), this adds an extra penalty per segment that did not exist before bq. his isn't a long-term solution, since the order in which Lucene visits the readers isn't in general guaranteed, that's where IndexSearcher.getIndexReaderBase(IndexReader) comes into play. If you call this in your scorer to get the docBase, it doesn't matter what order the segments are searched in (as it'll always return the proper base (in the context of the IndexSearcher that is)) Here's another potential thought (very rough, haven't consulted code to see how feasible this is): what if Similarity had a method called getDocIdBase(IndexReader) then, the searcher implementation could wrap the provided Similarity to provide the proper calculation Similarity is always already passed through this chain of Weight creation and is passed into the Scorer Obviously, a Query Implementation can completely drop the passing of the Searcher's similarity and drop in its own (but this would mean it doesn't care about getting these docid bases) I think this approach would potentially resolve all MultiSearcher difficulties Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746662#action_12746662 ] Tim Smith commented on LUCENE-1821: --- can i at least argue for it being tagged for 3.0 or 3.1 (just so it gets looked at again prior to the next releases) I have workarounds for 2.9, so i'm ok with it not getting in then (just want to make sure my use cases won't be made impossible in future releases) Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource
[ https://issues.apache.org/jira/browse/LUCENE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746450#action_12746450 ] Tim Smith commented on LUCENE-1842: --- Here's some pseudo code to hopefully fully show this use case: {code} // These guys are initialized once Analyzer analyzer1 = new SimpleAnalyzer(); Analyzer analyzer2 = new StandardAnalyzer(); Analyzer analyzer3 = new LowerCaseAnalyzer(); // This is done on a per Field basis Reader source1 = new StringReader(some text); Reader source2 = new StringReader(some more text); Reader source3 = new stringReader(final text); TokenStream stream1 = analyzer1.reusableTokenStream(source1); TokenStream stream2 = analyzer2.reusableTokenStream(source2); TokenStream stream3 = analyzer3.reusableTokenStream(source3); // Create the container for the shared attributes map AttributeSource attrs = new AttributeSource(); // Have all streams share the same attributes map stream1.reset(attrs); stream2.reset(attrs); stream3.reset(attrs); // Create my merging TokenStream (have it use attrs as its attribute source) TokenStream merger = new MergeTokenStreams(attrs, new TokenStream[] { stream1, stream2, stream3 }); /// Add a filter that will put a token prior to the source token stream, and after the source token stream is exhausted TokenStream finalStream = new WrapFilter(merger, anchor token); // finalStream will now be passed to the indexer {code} Hopefully this makes this use case more clear In order to use reusableTokenStreams from the Analyzers, the MergeTokenStreams must be able to share its attributes map with the underlaying TokenStreams its merging otherwise, MergeTokenStreams has to do something like this in its incrementToken: {code} public boolean incrementToken() { if (currentStream.incrementToken()) { copy currentStream.termAttr into my local termAttr copy currentStream.offsetsAttr into my local termAttr return true; } else { advance currentStream to be the next stream in line } } {code} as opposed to: {code} public boolean incrementToken() { if (currentStream.incrementToken()) { // don't need to do anything (because underlying tokenstreams share the same attributes map as me) return true; } else { advance currentStream to be the next stream in line } } {code} Hopefully this makes my use case clear Add reset(AttributeSource) method to AttributeSource Key: LUCENE-1842 URL: https://issues.apache.org/jira/browse/LUCENE-1842 Project: Lucene - Java Issue Type: Wish Components: Analysis Reporter: Tim Smith Priority: Minor Originally proposed in LUCENE-1826 Proposing the addition of the following method to AttributeSource {code} public void reset(AttributeSource input) { if (input == null) { throw new IllegalArgumentException(input AttributeSource must not be null); } this.attributes = input.attributes; this.attributeImpls = input.attributeImpls; this.factory = input.factory; } {code} Impacts: * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their reset() method, not in their constructor * requires making AttributeSource.attributes and AttributeSource.attributesImpl non-final Advantages: Allows creating only a single actual AttributeSource per thread that can then be used for indexing with a multitude of TokenStream/Tokenizer combinations (allowing utmost reuse of TokenStream/Tokenizer instances) this results in only a single attributes/attributesImpl map being required per thread addAttribute() calls will almost always return right away (will only be initialized once per thread) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource
[ https://issues.apache.org/jira/browse/LUCENE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746452#action_12746452 ] Tim Smith commented on LUCENE-1842: --- Yes, i know that creating the Tokenizer/TokenStream fully each time will do the trick as well, but i was hoping for some way to take advantage of the reusableTokenStream concepts (esecially in the case of Tokenizers that take a long time to construct (load resources/etc)) what i guess i really want is this method added to Analyzer: {code} public TokenStream tokenStream(AttributeSource attrs, Reader reader); {code} but i assume this would either have to reconstruct the full TokenStream chain every time (could be costly), or it would require AttributeSource.reset(AttributeSource) method in order to reuse saved streams Add reset(AttributeSource) method to AttributeSource Key: LUCENE-1842 URL: https://issues.apache.org/jira/browse/LUCENE-1842 Project: Lucene - Java Issue Type: Wish Components: Analysis Reporter: Tim Smith Priority: Minor Originally proposed in LUCENE-1826 Proposing the addition of the following method to AttributeSource {code} public void reset(AttributeSource input) { if (input == null) { throw new IllegalArgumentException(input AttributeSource must not be null); } this.attributes = input.attributes; this.attributeImpls = input.attributeImpls; this.factory = input.factory; } {code} Impacts: * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their reset() method, not in their constructor * requires making AttributeSource.attributes and AttributeSource.attributesImpl non-final Advantages: Allows creating only a single actual AttributeSource per thread that can then be used for indexing with a multitude of TokenStream/Tokenizer combinations (allowing utmost reuse of TokenStream/Tokenizer instances) this results in only a single attributes/attributesImpl map being required per thread addAttribute() calls will almost always return right away (will only be initialized once per thread) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource
[ https://issues.apache.org/jira/browse/LUCENE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746455#action_12746455 ] Tim Smith commented on LUCENE-1842: --- The problem with the MergeAnalyzer is that it requiers multiple Readers as input, but i think idea does put me on another (potentially better) track for handling sharing the same underlaying AttributeSource for all the merged tokenstreams (as well as sharing reusable TokenStreams) I'll try to put this to the test on monday when i get back to work Add reset(AttributeSource) method to AttributeSource Key: LUCENE-1842 URL: https://issues.apache.org/jira/browse/LUCENE-1842 Project: Lucene - Java Issue Type: Wish Components: Analysis Reporter: Tim Smith Priority: Minor Originally proposed in LUCENE-1826 Proposing the addition of the following method to AttributeSource {code} public void reset(AttributeSource input) { if (input == null) { throw new IllegalArgumentException(input AttributeSource must not be null); } this.attributes = input.attributes; this.attributeImpls = input.attributeImpls; this.factory = input.factory; } {code} Impacts: * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their reset() method, not in their constructor * requires making AttributeSource.attributes and AttributeSource.attributesImpl non-final Advantages: Allows creating only a single actual AttributeSource per thread that can then be used for indexing with a multitude of TokenStream/Tokenizer combinations (allowing utmost reuse of TokenStream/Tokenizer instances) this results in only a single attributes/attributesImpl map being required per thread addAttribute() calls will almost always return right away (will only be initialized once per thread) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource
[ https://issues.apache.org/jira/browse/LUCENE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746457#action_12746457 ] Tim Smith commented on LUCENE-1842: --- I would never use the Merging TokenStream when doing highlighting anyway, also, i'm sure i can get the Merging TokenStream to update the offsets to be appropriate (based on the merge) -- i never use offsets for anything right now anyway (although i may in the future) and i can't let the indexer do the merging because i want to add additional analytics on top of the merge (which can't be done on the sub streams in piecemeal fashion) also, Merging may not be a straight cat, more complex merges may merge sorted streams into a final sorted token stream, interleave tokens from sub streams in round robin fashion, and so on (the only use i have for it right now is the straight cat, however this concept could be applied to more nasty stuff) Add reset(AttributeSource) method to AttributeSource Key: LUCENE-1842 URL: https://issues.apache.org/jira/browse/LUCENE-1842 Project: Lucene - Java Issue Type: Wish Components: Analysis Reporter: Tim Smith Priority: Minor Originally proposed in LUCENE-1826 Proposing the addition of the following method to AttributeSource {code} public void reset(AttributeSource input) { if (input == null) { throw new IllegalArgumentException(input AttributeSource must not be null); } this.attributes = input.attributes; this.attributeImpls = input.attributeImpls; this.factory = input.factory; } {code} Impacts: * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their reset() method, not in their constructor * requires making AttributeSource.attributes and AttributeSource.attributesImpl non-final Advantages: Allows creating only a single actual AttributeSource per thread that can then be used for indexing with a multitude of TokenStream/Tokenizer combinations (allowing utmost reuse of TokenStream/Tokenizer instances) this results in only a single attributes/attributesImpl map being required per thread addAttribute() calls will almost always return right away (will only be initialized once per thread) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745941#action_12745941 ] Tim Smith commented on LUCENE-1821: --- I'm OK with having to jump through some hoops in order to get back to the full index context It would be nice if this was more facilitated by lucene's API (IMO, this would be best handled by adding a Searcher as the first arg to Weight.scorer(), as then a Weight will not need to hold on to this (breaking serializable)) There are definitely plenty of use cases that take advantage of the whole index (one created by IndexWriter), so this ability should not be removed I have at least 3 in my application alone (and they are all very important) You get tradeoffs working Per-Segment vs Per-MultiReader when it comes to caching in general going per-segment means caches load faster, and load less frequently, however this causes algorithms working with the caches to be slower (depending on algorithm and cache type) for static boosting from a field value (ValueSource), it makes no difference for numeric sorting, it makes no difference for string sorting, it makes a big difference - you now have to do a bunch of String.equals() calls, where you didn't have to in 2.4 (just used the ord index) Given this reason, you should really be able to do string sorting 2 ways * using per segment field cache (commit time/first query faster, sort time slower) * using multi-reader field cache (commit time/first query slower, sort time faster) This same argument also goes for features like faceting (not provided by lucene, but is provided by applications like solr, and my application). Using a per-segment cache will cause some significant performance loss when performing faceting, as it requires creating the facets for each segment, and then merging them (this results in a good deal of extra object overhead/memory overhead/more work where faceting on the multi-reader does not see this) In the end, it should be up to the application developer to choose what strategy works best for them, and their application (fast commits/fast cache loading may take a back seat to fast query execution) In general, i find there is a tradeoff between commit time and query time. The more you speed up commit time, the slower query time gets, and vice versa I just want/need the ability to choose Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745960#action_12745960 ] Tim Smith commented on LUCENE-1821: --- bq. You never officially had the full index context Officially, i didn't not have the full index context either (it was undefined at best, but was clear from both lucene code and my use of the API that i did have the full index context) Whenever i do a search, i always explicitly know what context i'm searching in (its always an IndexSearcher context) further, whenever i pass an IndexReader to any method (to create a cache/etc), i explicitly know what context i'm dealing with in order to know what the docids used mean as the application developer, i have full control over what i pass into the lucene API and where, and know the context of passing that in (javadoc should just be fully clear on how what goes in is used (if not already) (i always have the option to not use a utility class/method provided by lucene if it does not have the proper context semantics i need (and can write my own that does) bq. The current API would not support this without back compat breaks up the wazoo i kinda see what you mean here, but then how is it ok to pass an IndexReader to this method by the same right it seems like it should be ok to pass the IndexSearcher (the direct context for the IndexReader) for the IndexReader in question to Weight.scorer() if its ok to pass the IndexReader (the scorer() method's interface was already changed between 2.4 and 2.9 (adding allowDocsInOrder and topScorer)) bq. You can pick, but we have to be true to the API or change it (not easy with our back compat policies) be fair, 2.9 has a lot of back compat breaks, both in API and runtime behavior (i had tons of compile errors when i dropped 2.9 in, as well as some other hacks i had to add in (at least temporarily) in order to get 2.9 to work due to run time changes (primarily this per segment search stuff)) I have no problem with back compat breaks in general (only took me about a day to absorb 2.9 initially (still working on fully taking advantage of new features and getting rid of deprecated class use)) The only requirement i would put on a back compat break is that it have a workaround to get back the the previous versions behavior (in this case have it possible to remap the docids to the IndexSearcher context inside the scorer) Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory
[ https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745969#action_12745969 ] Tim Smith commented on LUCENE-1826: --- bq. This is not possible per design. The AttributeSource cannot be changed. I fully understand why but... it should be rather easy to add a reset(AttributeSource input) to AttributeSource {code} public void reset(AttributeSource input) { if (input == null) { throw new IllegalArgumentException(input AttributeSource must not be null); } this.attributes = input.attributes; this.attributeImpls = input.attributeImpls; this.factory = input.factory; } {code} This would require making attributes and attributeImpls non-final (potentially reducing some jvm caching capabilities) However, this then provides the ability to do even more Attribute reuse For example, if this method existed, the Indexer could use a ThreadLocal of raw AttributeSources (one AttributeSource per thread) then, prior to calling TokenStream.reset(), it could call TokenStream.reset(ThreadLocal AttributeSource) This would result in all token streams for the same document using the same AttributeSource (reusing TermAttribute, etc) This would require that the no TokenStreams/Filters/Tokenizers call addAttribute() in the constructor (they would have to do this in reset()) I totally get that this is a tall order If you want i can open a separate ticket for this (AttributeSource.reset(AttributeSource)) for further consideration All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory - Key: LUCENE-1826 URL: https://issues.apache.org/jira/browse/LUCENE-1826 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Assignee: Michael Busch Fix For: 2.9 I have a TokenStream implementation that joins together multiple sub TokenStreams (i then do additional filtering on top of this, so i can't just have the indexer do the merging) in 2.4, this worked fine. once one sub stream was exhausted, i just started using the next stream however, in 2.9, this is very difficult, and requires copying Term buffers for every token being aggregated however, if all the sub TokenStreams share the same AttributeSource, and my concat TokenStream shares the same AttributeSource, this goes back to being very simple (and very efficient) So for example, i would like to see the following constructor added to StandardTokenizer: {code} public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym) { super(source); ... } {code} would likewise want similar constructors added to all Tokenizer sub classes provided by lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745979#action_12745979 ] Tim Smith commented on LUCENE-1821: --- NOTE: if the leaf IndexSearcher were to be passed to scorer(), it would also have to be passed to explain() Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745977#action_12745977 ] Tim Smith commented on LUCENE-1821: --- {quote} It was an implementation detail. If you look at MultiSearcher, Searchable, Searcher and how the API is put together, you can see we don't support that type of thing. I think its fairly clear after a little thought. You can limit your API's to handle just IndexSearchers, but as a project, we cannot. {quote} I totally understand your resistance here. I get that i'm really utilizing advanced lucene concepts at very low levels (and these are subject to some changes that i will have to absorb with new versions) bq. Its okay to pass the Reader because its a contextless Reader. There is no value in also passing a contextless Searcher well, when you pass the Searcher that contains Reader, the Reader is no longer contextless. also, the context of the Searcher can be fairly well defined (its a leaf Searcher. the one that actually called Weight.scorer()) Also, looking a bit more at MultiSearcher semantics, sorting requires this leaf Searcher context in order to work already MultiSearcher just takes the top docs from each underlaying Searchable, adjusts the docids to the MultiSearcher Context, and sends them through another priority queue So, this leaf Searcher context concept is required by sorting already. I just want my Scorer to be given this leaf context as well Also, since it is a leaf context, the Weight.scorer() method could have the following interface: {code} /** * @param searcher The IndexSearcher that contains reader. */ public Scorer scorer(IndexSearcher searcher, IndexReader reader, boolean allowDocsInOrder, boolean topScorer); {code} then, with the patch i posted, i could call: searcher.getIndexReaderBase(reader) and i'm all set Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745988#action_12745988 ] Tim Smith commented on LUCENE-1821: --- here's what you can do: {code} /** @deprecated use {...@link getDocIdSet(IndexSearcher, IndexReader)} */ public DocIdSet getDocIdSet(final IndexReader reader) throws IOException { return getDocIdSet(new IndexSearcher(reader), reader); } public DocIdSet getDocIdSet(final IndexSearcher searcher, final IndexReader reader) { final Weight weight = query.weight(searcher); return new DocIdSet() { public DocIdSetIterator iterator() throws IOException { return weight.scorer(searcher, reader, true, false); } }; } {code} and yeah, i'm all for tons warnings in javadoc explicitly defining the contracts Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745991#action_12745991 ] Tim Smith commented on LUCENE-1821: --- what class is this getDocIdSet method on (lacking the context of where its used) Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746004#action_12746004 ] Tim Smith commented on LUCENE-1821: --- Looks like Filter should have another method added getDocIdSet(IndexSearcher searcher, IndexReader reader) (deprecating getDocIdSet(IndexReader)) new method would call old method by default (with little harm done in general) IndexSearcher would call the new getDocIdSet() variant QueryWrapperFilter would be updated to implement getDocIdSet(IndexSearcher, IndexReader) (with old method wrapping IndexReader with an IndexSearcher) This would actually be cleaner for QueryWrapperFilter, as it wouldn't have to create a new IndexSearcher on every call i definitely see that this is potentially more painful than the changes to the scorer() method (question is how many people implement custom Filters?) Personally, i don't use Filter, so any changes here don't impact me, but to the best of my knowledge, i'm not the only one using lucene :) Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1839) Scorer.explain is deprecated but abstract, should have impl that throws UnsupportedOperationException
Scorer.explain is deprecated but abstract, should have impl that throws UnsupportedOperationException - Key: LUCENE-1839 URL: https://issues.apache.org/jira/browse/LUCENE-1839 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Tim Smith Priority: Minor Fix For: 2.9 Suggest having Scorer implement explain to throw UnsupportedOperationException right now, i have to implement this method (because its abstract), and javac yells at me for overriding a deprecated method if the following implementation is in Scorer, i can remove my empty implementations of explain from my Scorers {code} /** Returns an explanation of the score for a document. * brWhen this method is used, the {...@link #next()}, {...@link #skipTo(int)} and * {...@link #score(HitCollector)} methods should not be used. * @param doc The document number for the explanation. * * @deprecated Please use {...@link IndexSearcher#explain} * or {...@link Weight#explain} instead. */ public Explanation explain(int doc) throws IOException { throw new UnsupportedOperationException(); } {code} best i figure, this shouldn't break back compat (people already have to recompile anyway) (2.9 definitely not binary compatible with 2.4) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746263#action_12746263 ] Tim Smith commented on LUCENE-1821: --- I started integrating the per-segment searching (removed my hack that was doing searching on MultiReader) In order to get my query implementations to work, i had to hold onto my Searcher in the Weight constructor and add getIndexReaderBase() method to my IndexSearcher implementation, and this seems to be working well I had 3 query implementations that were affected: one used a cache that will be easy to create per segment (will have this use a per segment cache as soon as i can) one used an int[] ord index (the underlaying cache cannot be made per segment) one used a cached DocIdSet created over the top level MultiReader (should be able to have a DocIdSet per Segment reader here, but this will take some more thinking (source of the matching docids is from a separate index), will also need to know which sub docidset to use based on which IndexReader is passed to scorer() - shouldn't be any big deal) i'm a bit concerned that i may not be testing multi-segment searching quite properly right now though since i think most of my indexes being tested only have one segment. On that topic, if i create a subclass of LogByteSizeMergePolicy and return null from findMerges() and findMergesToExpungeDeletes() will this guarantee that segments will only be merged if i explicitly optimize? In which case, i can just pepper in some commits as i add documents to guarantee that i have more than 1 segment. Overall, i am really liking the per-segment stuff, and the Collector API in general its already made it possible to optimize a good deal of things away (like calling Scorer.score() for docs that end up getting filtered away), however i hit some deoptimization due to some of the crazy stuff i had to do to make those 3 query implementations work, but this should only really be isolated to one of the implementations (and i can hopefully reoptimize those cases anyway) I would still like to see IndexSearcher passed to Weight.scorer(), and the getIndexReaderBase() method added to IndexSearcher This will clean up my current hacks to map docids Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory
[ https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746356#action_12746356 ] Tim Smith commented on LUCENE-1826: --- i'll fork off another ticket for the reset(AttributeSource) method All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory - Key: LUCENE-1826 URL: https://issues.apache.org/jira/browse/LUCENE-1826 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Assignee: Michael Busch Fix For: 2.9 Attachments: lucene-1826.patch I have a TokenStream implementation that joins together multiple sub TokenStreams (i then do additional filtering on top of this, so i can't just have the indexer do the merging) in 2.4, this worked fine. once one sub stream was exhausted, i just started using the next stream however, in 2.9, this is very difficult, and requires copying Term buffers for every token being aggregated however, if all the sub TokenStreams share the same AttributeSource, and my concat TokenStream shares the same AttributeSource, this goes back to being very simple (and very efficient) So for example, i would like to see the following constructor added to StandardTokenizer: {code} public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym) { super(source); ... } {code} would likewise want similar constructors added to all Tokenizer sub classes provided by lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory
[ https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746360#action_12746360 ] Tim Smith commented on LUCENE-1826: --- forked off the reset(AttributeSource) to LUCENE-1842 All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory - Key: LUCENE-1826 URL: https://issues.apache.org/jira/browse/LUCENE-1826 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Assignee: Michael Busch Fix For: 2.9 Attachments: lucene-1826.patch I have a TokenStream implementation that joins together multiple sub TokenStreams (i then do additional filtering on top of this, so i can't just have the indexer do the merging) in 2.4, this worked fine. once one sub stream was exhausted, i just started using the next stream however, in 2.9, this is very difficult, and requires copying Term buffers for every token being aggregated however, if all the sub TokenStreams share the same AttributeSource, and my concat TokenStream shares the same AttributeSource, this goes back to being very simple (and very efficient) So for example, i would like to see the following constructor added to StandardTokenizer: {code} public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym) { super(source); ... } {code} would likewise want similar constructors added to all Tokenizer sub classes provided by lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource
Add reset(AttributeSource) method to AttributeSource Key: LUCENE-1842 URL: https://issues.apache.org/jira/browse/LUCENE-1842 Project: Lucene - Java Issue Type: Wish Components: Analysis Reporter: Tim Smith Fix For: 2.9 Originally proposed in LUCENE-1826 Proposing the addition of the following method to AttributeSource {code} public void reset(AttributeSource input) { if (input == null) { throw new IllegalArgumentException(input AttributeSource must not be null); } this.attributes = input.attributes; this.attributeImpls = input.attributeImpls; this.factory = input.factory; } {code} Impacts: * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their reset() method, not in their constructor * requires making AttributeSource.attributes and AttributeSource.attributesImpl non-final Advantages: Allows creating only a single actual AttributeSource per thread that can then be used for indexing with a multitude of TokenStream/Tokenizer combinations (allowing utmost reuse of TokenStream/Tokenizer instances) this results in only a single attributes/attributesImpl map being required per thread addAttribute() calls will almost always return right away (will only be initialized once per thread) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1842) Add reset(AttributeSource) method to AttributeSource
[ https://issues.apache.org/jira/browse/LUCENE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746380#action_12746380 ] Tim Smith commented on LUCENE-1842: --- bq. still pay the price for filling the two hashmaps and the cache lookups. this would only ever be incurred once per thread (if the same root AttributeSource was always used) the cache lookups would still need to be done at TokenStream.reset() time, however they would pretty much always get a hit the main use case this proposal supports is as follows: i have a TokenStream that merges multiple sub token streams (i call this out in LUCENE-1826) in order to do this really efficiently, all sub token streams need to share the same AttributeSource then, the merging TokenStream can just iterate through its sub streams, calling incrementToken() to consume all tokens from each stream without the ability to reset the sub streams AttributeSource to the same AttributeSource used by this merging TokenStream, the you have to copy the attributes from the sub streams as you iterate furthermore, the sub TokenStreams could potentially be any TokenStream (or chain of TokenStreams rooted with a Tokenizer) without the reset(AttributeSource) method, i would have to create the TokenStream chain anew for every merging TokenStream (or do the attribute copying approach) Add reset(AttributeSource) method to AttributeSource Key: LUCENE-1842 URL: https://issues.apache.org/jira/browse/LUCENE-1842 Project: Lucene - Java Issue Type: Wish Components: Analysis Reporter: Tim Smith Priority: Minor Fix For: 2.9 Originally proposed in LUCENE-1826 Proposing the addition of the following method to AttributeSource {code} public void reset(AttributeSource input) { if (input == null) { throw new IllegalArgumentException(input AttributeSource must not be null); } this.attributes = input.attributes; this.attributeImpls = input.attributeImpls; this.factory = input.factory; } {code} Impacts: * requires all TokenStreams/TokenFIlters/etc to call addAttribute() in their reset() method, not in their constructor * requires making AttributeSource.attributes and AttributeSource.attributesImpl non-final Advantages: Allows creating only a single actual AttributeSource per thread that can then be used for indexing with a multitude of TokenStream/Tokenizer combinations (allowing utmost reuse of TokenStream/Tokenizer instances) this results in only a single attributes/attributesImpl map being required per thread addAttribute() calls will almost always return right away (will only be initialized once per thread) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745423#action_12745423 ] Tim Smith commented on LUCENE-1821: --- I can work up another patch where the Searcher is passed into Weight.scorer() as well if that is an acceptable approach (this method was already changed alot in 2.9 anyway) Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745436#action_12745436 ] Tim Smith commented on LUCENE-1821: --- true, MultiSearcher does kink things up some (and the Searcher abstract class in general) personally, this is not a problem for me (don't use MultiSearcher (not yet at least)), and i'm happy with being passed the IndexSearcher instance that directly contains the IndexReader i'm being passed The contract could be marked that the Searcher provided is the direct container of the IndexReader also passed at which point, both explain() and scorer() would be accurate in terms of this I would almost like to see something different passed in instead of a Searcher/IndexReader pair i would actually like to see a SearchContext sort of object passed in this would represent the whole tree of Searchers/IndexReaders this would allow access to the MultiSearcher, the direct IndexSearcher, and the sub IndexReader (which should actually be used for the scoring) (as well as any other Searcher's in the call stack) this SearchContext could also pass in the topScorer/allowDocsInOrder flags (but that would be more difficult as scorers have subscorers that need to sometimes be created with different flags for these), but this SearchContext could be used to pass more information throughout the Scorer API in general from the top level (like - always use constant score queries where possible, use scoring algorithm X, Y, or Z, and so on) obviously this would impact the API of Searcher a good deal as it would have to maintain this stack as sub Searcher's search() methods are called) Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1825) AttributeSource.getAttribute() should throw better IllegalArgumentException
AttributeSource.getAttribute() should throw better IllegalArgumentException --- Key: LUCENE-1825 URL: https://issues.apache.org/jira/browse/LUCENE-1825 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor when seting use only new API for TokenStream, i received the following exception: {code} [junit] Caused by: java.lang.IllegalArgumentException: This AttributeSource does not have the attribute 'interface org.apache.lucene.analysis.tokenattributes.TermAttribute'. [junit] at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:249) [junit] at org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:145) [junit] at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755) [junit] at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2613) {code} However, i can't actually see the culprit that caused this exception suggest that the IllegalArgumentException include getClass().getName() in order to be able to identify which TokenStream implementation actually caused this -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1825) AttributeSource.getAttribute() should throw better IllegalArgumentException
[ https://issues.apache.org/jira/browse/LUCENE-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745502#action_12745502 ] Tim Smith commented on LUCENE-1825: --- Looked a little closer on this and it looks like if the root TokenStream does not addAttribute() for all attributes expected by the indexer, this exception occurs I suppose if the Indexer called addAttribute() instead of getAttribute() this wouldn't happen (attributes not provided by TokenStream, but required by Indexer would be initialized at index time (and would remain empty)) AttributeSource.getAttribute() should throw better IllegalArgumentException --- Key: LUCENE-1825 URL: https://issues.apache.org/jira/browse/LUCENE-1825 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor when seting use only new API for TokenStream, i received the following exception: {code} [junit] Caused by: java.lang.IllegalArgumentException: This AttributeSource does not have the attribute 'interface org.apache.lucene.analysis.tokenattributes.TermAttribute'. [junit] at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:249) [junit] at org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:145) [junit] at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755) [junit] at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2613) {code} However, i can't actually see the culprit that caused this exception suggest that the IllegalArgumentException include getClass().getName() in order to be able to identify which TokenStream implementation actually caused this -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1825) AttributeSource.getAttribute() should throw better IllegalArgumentException
[ https://issues.apache.org/jira/browse/LUCENE-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745508#action_12745508 ] Tim Smith commented on LUCENE-1825: --- Updated getAttribute() on AttributeSource as follows to find the source of my pain: {code} /** * The caller must pass in a Classlt;? extends Attributegt; value. * Returns the instance of the passed in Attribute contained in this AttributeSource * * @throws IllegalArgumentException if this AttributeSource does not contain the * Attribute */ public AttributeImpl getAttribute(Class attClass) { AttributeImpl att = (AttributeImpl) this.attributes.get(attClass); if (att == null) { throw new IllegalArgumentException(getClass().getName() + does not have the attribute ' + attClass + '.); } return att; } {code} I see that this could end up being an arbitrary org.apache.lucene.util.AttributeSource though if you aren't fully integrating the new api AttributeSource.getAttribute() should throw better IllegalArgumentException --- Key: LUCENE-1825 URL: https://issues.apache.org/jira/browse/LUCENE-1825 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor when seting use only new API for TokenStream, i received the following exception: {code} [junit] Caused by: java.lang.IllegalArgumentException: This AttributeSource does not have the attribute 'interface org.apache.lucene.analysis.tokenattributes.TermAttribute'. [junit] at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:249) [junit] at org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:145) [junit] at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755) [junit] at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2613) {code} However, i can't actually see the culprit that caused this exception suggest that the IllegalArgumentException include getClass().getName() in order to be able to identify which TokenStream implementation actually caused this -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1826) All Tokenizer implementations should have constructor that takes an AttributeSource
All Tokenizer implementations should have constructor that takes an AttributeSource --- Key: LUCENE-1826 URL: https://issues.apache.org/jira/browse/LUCENE-1826 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith I have a TokenStream implementation that joins together multiple sub TokenStreams (i then do additional filtering on top of this, so i can't just have the indexer do the merging) in 2.4, this worked fine. once one sub stream was exhausted, i just started using the next stream however, in 2.9, this is very difficult, and requires copying Term buffers for every token being aggregated however, if all the sub TokenStreams share the same AttributeSource, and my concat TokenStream shares the same AttributeSource, this goes back to being very simple (and very efficient) So for example, i would like to see the following constructor added to StandardTokenizer: {code} public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym) { super(source); ... } {code} would likewise want similar constructors added to all Tokenizer sub classes provided by lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructor that takes an AttributeSource
[ https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745523#action_12745523 ] Tim Smith commented on LUCENE-1826: --- i'll do that from now on (feel free to boot them if you feel necessary (didn't want to overstep my bounds suggesting fix in 2.9)) All Tokenizer implementations should have constructor that takes an AttributeSource --- Key: LUCENE-1826 URL: https://issues.apache.org/jira/browse/LUCENE-1826 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 I have a TokenStream implementation that joins together multiple sub TokenStreams (i then do additional filtering on top of this, so i can't just have the indexer do the merging) in 2.4, this worked fine. once one sub stream was exhausted, i just started using the next stream however, in 2.9, this is very difficult, and requires copying Term buffers for every token being aggregated however, if all the sub TokenStreams share the same AttributeSource, and my concat TokenStream shares the same AttributeSource, this goes back to being very simple (and very efficient) So for example, i would like to see the following constructor added to StandardTokenizer: {code} public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym) { super(source); ... } {code} would likewise want similar constructors added to all Tokenizer sub classes provided by lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1826) All Tokenizer implementations should have constructor that takes an AttributeSource
[ https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1826: -- Fix Version/s: 2.9 All Tokenizer implementations should have constructor that takes an AttributeSource --- Key: LUCENE-1826 URL: https://issues.apache.org/jira/browse/LUCENE-1826 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 I have a TokenStream implementation that joins together multiple sub TokenStreams (i then do additional filtering on top of this, so i can't just have the indexer do the merging) in 2.4, this worked fine. once one sub stream was exhausted, i just started using the next stream however, in 2.9, this is very difficult, and requires copying Term buffers for every token being aggregated however, if all the sub TokenStreams share the same AttributeSource, and my concat TokenStream shares the same AttributeSource, this goes back to being very simple (and very efficient) So for example, i would like to see the following constructor added to StandardTokenizer: {code} public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym) { super(source); ... } {code} would likewise want similar constructors added to all Tokenizer sub classes provided by lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1825) AttributeSource.getAttribute() should throw better IllegalArgumentException
[ https://issues.apache.org/jira/browse/LUCENE-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1825: -- Fix Version/s: 2.9 AttributeSource.getAttribute() should throw better IllegalArgumentException --- Key: LUCENE-1825 URL: https://issues.apache.org/jira/browse/LUCENE-1825 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor Fix For: 2.9 when seting use only new API for TokenStream, i received the following exception: {code} [junit] Caused by: java.lang.IllegalArgumentException: This AttributeSource does not have the attribute 'interface org.apache.lucene.analysis.tokenattributes.TermAttribute'. [junit] at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:249) [junit] at org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:252) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:145) [junit] at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755) [junit] at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2613) {code} However, i can't actually see the culprit that caused this exception suggest that the IllegalArgumentException include getClass().getName() in order to be able to identify which TokenStream implementation actually caused this -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructor that takes an AttributeSource
[ https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745530#action_12745530 ] Tim Smith commented on LUCENE-1826: --- NOTE: for me, this is just a nice to have I currently only use my concat TokenStream on my own TokenStream implementations right now (so i can do this manually on my own TokenStream Impls) however i would like to be able to directly use lucene Tokenizers under my concat TokenStream under some situations in the future All Tokenizer implementations should have constructor that takes an AttributeSource --- Key: LUCENE-1826 URL: https://issues.apache.org/jira/browse/LUCENE-1826 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Fix For: 2.9 I have a TokenStream implementation that joins together multiple sub TokenStreams (i then do additional filtering on top of this, so i can't just have the indexer do the merging) in 2.4, this worked fine. once one sub stream was exhausted, i just started using the next stream however, in 2.9, this is very difficult, and requires copying Term buffers for every token being aggregated however, if all the sub TokenStreams share the same AttributeSource, and my concat TokenStream shares the same AttributeSource, this goes back to being very simple (and very efficient) So for example, i would like to see the following constructor added to StandardTokenizer: {code} public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym) { super(source); ... } {code} would likewise want similar constructors added to all Tokenizer sub classes provided by lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org