[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names
[ https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034640#comment-13034640 ] Earwin Burrfoot commented on LUCENE-3105: - Hmm.. Ok, it *is* still used, but that's gonna be fixed, mm? > String.intern() calls slow down IndexWriter.close() and IndexReader.open() > for index with large number of unique field names > > > Key: LUCENE-3105 > URL: https://issues.apache.org/jira/browse/LUCENE-3105 > Project: Lucene - Java > Issue Type: Bug > Components: core/index >Affects Versions: 3.1 >Reporter: Mark Kristensson > Attachments: LUCENE-3105.patch > > > We have one index with several hundred thousand unqiue field names (we're > optimistic that Lucene 4.0 is flexible enough to allow us to change our index > design...) and found that opening an index writer and closing an index reader > results in horribly slow performance on that one index. I have isolated the > problem down to the calls to String.intern() that are used to allow for quick > string comparisons of field names throughout Lucene. These String.intern() > calls are unnecessary and can be replaced with a hashmap lookup. In fact, > StringHelper.java has its own hashmap implementation that it uses in > conjunction with String.intern(). Rather than using a one-off hashmap, I've > elected to use a ConcurrentHashMap in this patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names
[ https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034639#comment-13034639 ] Earwin Burrfoot commented on LUCENE-3105: - StringInterner is in fact faster than CHM. And is compatible with String.intern(), ie - it returns the same String instances. It also won't eat up memory if spammed with numerous unique strings (which is a strange feature, but people requested that). In Lucene 4.0 all of this is moot anyway, fields there are strongly separated and intern() is not used. > String.intern() calls slow down IndexWriter.close() and IndexReader.open() > for index with large number of unique field names > > > Key: LUCENE-3105 > URL: https://issues.apache.org/jira/browse/LUCENE-3105 > Project: Lucene - Java > Issue Type: Bug > Components: core/index >Affects Versions: 3.1 >Reporter: Mark Kristensson > Attachments: LUCENE-3105.patch > > > We have one index with several hundred thousand unqiue field names (we're > optimistic that Lucene 4.0 is flexible enough to allow us to change our index > design...) and found that opening an index writer and closing an index reader > results in horribly slow performance on that one index. I have isolated the > problem down to the calls to String.intern() that are used to allow for quick > string comparisons of field names throughout Lucene. These String.intern() > calls are unnecessary and can be replaced with a hashmap lookup. In fact, > StringHelper.java has its own hashmap implementation that it uses in > conjunction with String.intern(). Rather than using a one-off hashmap, I've > elected to use a ConcurrentHashMap in this patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033279#comment-13033279 ] Earwin Burrfoot commented on LUCENE-2793: - As mentioned @LUCENE-3092, it would be nice not to include the OneMerge, but some meaningful value like 'expectedSize', 'expectedSegmentSize' or whatnot, that would work both for merges *and* flushes, and also won't introduce needless dependency on MergePolicy. > Directory createOutput and openInput should take an IOContext > - > > Key: LUCENE-2793 > URL: https://issues.apache.org/jira/browse/LUCENE-2793 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless >Assignee: Simon Willnauer > Labels: gsoc2011, lucene-gsoc-11, mentor > Attachments: LUCENE-2793.patch > > > Today for merging we pass down a larger readBufferSize than for searching > because we get better performance. > I think we should generalize this to a class (IOContext), which would hold > the buffer size, but then could hold other flags like DIRECT (bypass OS's > buffer cache), SEQUENTIAL, etc. > Then, we can make the DirectIOLinuxDirectory fully usable because we would > only use DIRECT/SEQUENTIAL during merging. > This will require fixing how IW pools readers, so that a reader opened for > merging is not then used for searching, and vice/versa. Really, it's only > all the open file handles that need to be different -- we could in theory > share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032997#comment-13032997 ] Earwin Burrfoot commented on LUCENE-3092: - bq. The IOCtx should reference the OneMerge (if in fact this file is being opened because of a merge)? IOCtx should have a value 'expectedSize', or 'priority', or something similar. This does not introduce a transitive dependency of Directory from MergePolicy (to please you once more - a true WTF), and this allows to apply the same logic to flushes. Eg - all small flushes/merges go to cache, all big flushes/merges go straight to disk. > NRTCachingDirectory, to buffer small segments in a RAMDir > - > > Key: LUCENE-3092 > URL: https://issues.apache.org/jira/browse/LUCENE-3092 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch > > > I created this simply Directory impl, whose goal is reduce IO > contention in a frequent reopen NRT use case. > The idea is, when reopening quickly, but not indexing that much > content, you wind up with many small files created with time, that can > possibly stress the IO system eg if merges, searching are also > fighting for IO. > So, NRTCachingDirectory puts these newly created files into a RAMDir, > and only when they are merged into a too-large segment, does it then > write-through to the real (delegate) directory. > This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032989#comment-13032989 ] Earwin Burrfoot commented on LUCENE-3092: - bq. but I couldn't disagree more that this is an issue with an Event model There are no issues with event model itself. It's just that this model is badly suitable for this issue's usecase. Event listeners are good. Using them to emulate what is essentially a mutex - is ugly and fragile as hell. bq. We have a series of components in Lucene; Directories, IndexWriter, MergeScheduler etc, and we have some crosscutting concerns such as merges themselves. My point is that for many concerns they shouldn't necessarily be crosscutting. Eg - Directory can support IO priorities/throttling, so it doesn't have to know about merges or flushes. Many OSes have have special APIs that allow IO prioritization, do they know about merges, or Lucene at all? No. > NRTCachingDirectory, to buffer small segments in a RAMDir > - > > Key: LUCENE-3092 > URL: https://issues.apache.org/jira/browse/LUCENE-3092 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch > > > I created this simply Directory impl, whose goal is reduce IO > contention in a frequent reopen NRT use case. > The idea is, when reopening quickly, but not indexing that much > content, you wind up with many small files created with time, that can > possibly stress the IO system eg if merges, searching are also > fighting for IO. > So, NRTCachingDirectory puts these newly created files into a RAMDir, > and only when they are merged into a too-large segment, does it then > write-through to the real (delegate) directory. > This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032936#comment-13032936 ] Earwin Burrfoot commented on LUCENE-3092: - Chris, I don't like the idea of expanding IOContext again and again, but this case seems in line with intended purporse - give Directory implementation hints as to what we're going to do with it. I don't like events either. They look fragile and binding them to threads is a WTF. With all our pausing/unpausing magic there's no guarantee merge will end on the same thread it started on. bq. Stuff like FlushPolicy could take information about concurrent merges and hold of flushes for a little while if memory allows it etc. Coordinating access to shared resource (IO subsystem) with events is very awkward. Ok, your FlushPolicy receives events from MergePolicy and holds flushes during merge. _Now, when a flush is in progress, should FlushPolicy notify MergePolicy so it can hold its merges?_ It goes downhill from there. What if FP and MP fire events simultaneously? :) What should other listeners do? Try looking at a bigger picture. Merges are not your problem. Neither are flushes. Your problem is that several threads try to take their dump on disk simultaneously (for whatever reason, you don't really care). So what we need is an arbitration mechanism for Directory writes. A mechanism located presumably @ Directory level (eg, we don't need to throttle anything when writing to RAMDir). One possible implementation is that we add a constructor parameter to FSDirectory specifying desired level of IO parallelism, and then it keeps track of its IndexOutputs and stalls writes selectively. We can also add 'expectedWriteSize' to IOContext, so the Directory may favor shorter writes over bigger ones. Instead of 'expectedWriteSize' we can use 'priority'. > NRTCachingDirectory, to buffer small segments in a RAMDir > - > > Key: LUCENE-3092 > URL: https://issues.apache.org/jira/browse/LUCENE-3092 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch > > > I created this simply Directory impl, whose goal is reduce IO > contention in a frequent reopen NRT use case. > The idea is, when reopening quickly, but not indexing that much > content, you wind up with many small files created with time, that can > possibly stress the IO system eg if merges, searching are also > fighting for IO. > So, NRTCachingDirectory puts these newly created files into a RAMDir, > and only when they are merged into a too-large segment, does it then > write-through to the real (delegate) directory. > This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032841#comment-13032841 ] Earwin Burrfoot commented on LUCENE-3092: - *highfive Uwe* was going to suggest the very same thing. IOContext can include expected size. NRTCD will do its magic, other dirs may prealloc. > NRTCachingDirectory, to buffer small segments in a RAMDir > - > > Key: LUCENE-3092 > URL: https://issues.apache.org/jira/browse/LUCENE-3092 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch > > > I created this simply Directory impl, whose goal is reduce IO > contention in a frequent reopen NRT use case. > The idea is, when reopening quickly, but not indexing that much > content, you wind up with many small files created with time, that can > possibly stress the IO system eg if merges, searching are also > fighting for IO. > So, NRTCachingDirectory puts these newly created files into a RAMDir, > and only when they are merged into a too-large segment, does it then > write-through to the real (delegate) directory. > This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be List not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032099#comment-13032099 ] Earwin Burrfoot commented on LUCENE-3084: - bq. Merges are ordered Hmm.. Why should they be? bq. SegmentInfos itself must be list It may contain list as a field instead. And have a much cleaner API as a consequence. On another note, I wonder, is the fact that Vector is internally synchronized used somewhere within SegmentInfos client code? > MergePolicy.OneMerge.segments should be List not SegmentInfos > -- > > Key: LUCENE-3084 > URL: https://issues.apache.org/jira/browse/LUCENE-3084 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3084-trunk-only.patch, > LUCENE-3084-trunk-only.patch, LUCENE-3084.patch > > > SegmentInfos carries a bunch of fields beyond the list of SI, but for merging > purposes these fields are unused. > We should cutover to List instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be List not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032046#comment-13032046 ] Earwin Burrfoot commented on LUCENE-3084: - * Speaking logically, merges operate on Sets of SIs, not List? * Let's stop subclassing random things? : ) SIS can contain a List of SIs (and maybe a Set, or whatever we need in the future), and only expose operations its clients really need. > MergePolicy.OneMerge.segments should be List not SegmentInfos > -- > > Key: LUCENE-3084 > URL: https://issues.apache.org/jira/browse/LUCENE-3084 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084.patch > > > SegmentInfos carries a bunch of fields beyond the list of SI, but for merging > purposes these fields are unused. > We should cutover to List instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3077) DWPT doesn't see changes to DW#infoStream
[ https://issues.apache.org/jira/browse/LUCENE-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029881#comment-13029881 ] Earwin Burrfoot commented on LUCENE-3077: - We should just make it final everywhere ... > DWPT doesn't see changes to DW#infoStream > - > > Key: LUCENE-3077 > URL: https://issues.apache.org/jira/browse/LUCENE-3077 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 4.0 >Reporter: Simon Willnauer >Priority: Minor > Fix For: 4.0 > > > DW does not push infostream changes to DWPT since DWPT#infoStream is final > and initialized on DWPTPool initialization (at least for initial DWPT) we > should push changes to infostream to DWPT too -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3065) NumericField should be stored in binary format in index (matching Solr's format)
[ https://issues.apache.org/jira/browse/LUCENE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029421#comment-13029421 ] Earwin Burrfoot commented on LUCENE-3065: - It's sad NumericFields are hardbaked into index format. Eg - I have some fields that are similar to Numeric in that they are 'stringified' binary structures, and they can't become first-class in the same manner as Numeric. > NumericField should be stored in binary format in index (matching Solr's > format) > > > Key: LUCENE-3065 > URL: https://issues.apache.org/jira/browse/LUCENE-3065 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch, > LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch > > > (Spinoff of LUCENE-3001) > Today when writing stored fields we don't record that the field was a > NumericField, and so at IndexReader time you get back an "ordinary" Field and > your number has turned into a string. See > https://issues.apache.org/jira/browse/LUCENE-1701?focusedCommentId=12721972&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12721972 > We have spare bits already in stored fields, so, we should use one to record > that the field is numeric, and then encode the numeric field in Solr's > more-compact binary format. > A nice side-effect is we fix the long standing issue that you don't get a > NumericField back when loading your document. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2904) non-contiguous LogMergePolicy should be careful to not select merges already running
[ https://issues.apache.org/jira/browse/LUCENE-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029408#comment-13029408 ] Earwin Burrfoot commented on LUCENE-2904: - Ok, I'm wrong. We need both a list of all SIs and eligible SIs for calculations. But that should be handled through API change, not a new public method on IW. > non-contiguous LogMergePolicy should be careful to not select merges already > running > > > Key: LUCENE-2904 > URL: https://issues.apache.org/jira/browse/LUCENE-2904 > Project: Lucene - Java > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-2904.patch > > > Now that LogMP can do non-contiguous merges, the fact that it disregards > which segments are already being merged is more problematic since it could > result in it returning conflicting merges and thus failing to run multiple > merges concurrently. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2904) non-contiguous LogMergePolicy should be careful to not select merges already running
[ https://issues.apache.org/jira/browse/LUCENE-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029403#comment-13029403 ] Earwin Burrfoot commented on LUCENE-2904: - I think we should simply change the API for MergePolicy. Instead of SegmentInfos it should accept a Set with SIs eligible for merging (eg, completely written & not elected for another merge). IW.getMergingSegments() is a damn cheat, and "Expert" notice is not an excuse! :) Why should each and every MP do the set substraction when IW can do it for them once and for all? > non-contiguous LogMergePolicy should be careful to not select merges already > running > > > Key: LUCENE-2904 > URL: https://issues.apache.org/jira/browse/LUCENE-2904 > Project: Lucene - Java > Issue Type: Bug >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-2904.patch > > > Now that LogMP can do non-contiguous merges, the fact that it disregards > which segments are already being merged is more problematic since it could > result in it returning conflicting merges and thus failing to run multiple > merges concurrently. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation
[ https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027626#comment-13027626 ] Earwin Burrfoot commented on LUCENE-3061: - Mark these as @experimental? > Open IndexWriter API to allow custom MergeScheduler implementation > -- > > Key: LUCENE-3061 > URL: https://issues.apache.org/jira/browse/LUCENE-3061 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3061.patch, LUCENE-3061.patch > > > IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which > makes it impossible for someone to implement his own MergeScheduler. We > should open up these API, as well as any other that can be useful for custom > MS implementations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027612#comment-13027612 ] Earwin Burrfoot edited comment on LUCENE-3041 at 5/2/11 10:30 AM: -- The static cache is now not threadsafe. And original had nice diagnostics for ambigous dispatches. Why not just take it and cut over to JDK reflection and CHM? Same can be said for tests. What about throwing original invocation exception instead of the wrapper? Since we're emulating a language feature, a simple method call, it's logical to only throw custom exceptions in .. well .. exceptional cases, like ambiguity/no matching method. If client code throws Errors/RuntimeExceptions, they should be transparently rethrown. was (Author: earwin): The static cache is now not threadsafe. And original had nice diagnostics for ambigous dispatches. Why not just take it and cut over to JDK reflection and CHM? > Support Query Visting / Walking > --- > > Key: LUCENE-3041 > URL: https://issues.apache.org/jira/browse/LUCENE-3041 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 4.0 >Reporter: Chris Male >Assignee: Simon Willnauer >Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, > LUCENE-3041.patch, LUCENE-3041.patch > > > Out of the discussion in LUCENE-2868, it could be useful to add a generic > Query Visitor / Walker that could be used for more advanced rewriting, > optimizations or anything that requires state to be stored as each Query is > visited. > We could keep the interface very simple: > {code} > public interface QueryVisitor { > Query visit(Query query); > } > {code} > and then use a reflection based visitor like Earwin suggested, which would > allow implementators to provide visit methods for just Querys that they are > interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027612#comment-13027612 ] Earwin Burrfoot commented on LUCENE-3041: - The static cache is now not threadsafe. And original had nice diagnostics for ambigous dispatches. Why not just take it and cut over to JDK reflection and CHM? > Support Query Visting / Walking > --- > > Key: LUCENE-3041 > URL: https://issues.apache.org/jira/browse/LUCENE-3041 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 4.0 >Reporter: Chris Male >Assignee: Simon Willnauer >Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, > LUCENE-3041.patch, LUCENE-3041.patch > > > Out of the discussion in LUCENE-2868, it could be useful to add a generic > Query Visitor / Walker that could be used for more advanced rewriting, > optimizations or anything that requires state to be stored as each Query is > visited. > We could keep the interface very simple: > {code} > public interface QueryVisitor { > Query visit(Query query); > } > {code} > and then use a reflection based visitor like Earwin suggested, which would > allow implementators to provide visit methods for just Querys that they are > interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027361#comment-13027361 ] Earwin Burrfoot commented on LUCENE-3055: - Could anyone remind me, why the hell do we still have Analyzer.tokenStream AND reusableTokenStream rampaging around and confusing minds? We always recommend to use the latter, Robert just fixed some of the core classes to use the latter. Also, if reusableTokenStream is the only method left standing, isn't it wise to hide actual reuse somewhere in Lucene internals and turn Analyzer into plain and dumb factory interface? > LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers > -- > > Key: LUCENE-3055 > URL: https://issues.apache.org/jira/browse/LUCENE-3055 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 3.1 >Reporter: Ian Soboroff > > LUCENE-2372 and LUCENE-2389 marked all analyzers as final. This makes > ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. > StandardAnalyzer to make a small modification e.g. to tokenStream(). These > issues don't indicate a new method of doing this. The issues don't give a > reason except for design considerations, which seems a poor reason to make a > backward-incompatible change -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027080#comment-13027080 ] Earwin Burrfoot commented on LUCENE-3041: - I vehemently oppose introducing the "visitor design pattern" (classic double-dispatch version) into the Query API. It is a badly broken replacement (ie, cannot be easily extended) for multiple dispatch. Also, from the looks of it (short IRC discussion), user-written visitors and rewrite() API have totally different aims. - rewrite() is very specific (it is a pre-search preparation that produces runnable query, eg expands multi-term queries into OR sequences or wrapped filters), but should work over any kinds of user-written Queries with possibly exotic behaviours (eg, take rewrite from the cache). Consequently, the logic is tightly coupled to each Query-impl innards. - user-written visitors on the other hand, may have a multitude of purporses (wildly varying logic for node handling + navigation - eg, some may want to see MTQs expanded, and some may not) over relatively fixed number of possible node types. So the best possible solution so far is to keep rewrite() asis - it serves its purporse quite well. And introduce generic reflection-based multiple-dispatch visitor that can walk any kind of hierarchies (eg, in my project I rewrite ASTs to ASTs, ASTs to Queries, and Queries to bags of Terms) so people can transform their query trees. The current patch contains a derivative of [my original version|https://gist.github.com/dfebaf79f5524e6ea8b4]. And here's a [test/example|https://gist.github.com/e5eb67d762be0bce8d28] This visitor keeps all logic on itself and thus cannot replace rewrite(). > Support Query Visting / Walking > --- > > Key: LUCENE-3041 > URL: https://issues.apache.org/jira/browse/LUCENE-3041 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Chris Male >Priority: Minor > Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, > LUCENE-3041.patch > > > Out of the discussion in LUCENE-2868, it could be useful to add a generic > Query Visitor / Walker that could be used for more advanced rewriting, > optimizations or anything that requires state to be stored as each Query is > visited. > We could keep the interface very simple: > {code} > public interface QueryVisitor { > Query visit(Query query); > } > {code} > and then use a reflection based visitor like Earwin suggested, which would > allow implementators to provide visit methods for just Querys that they are > interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2571) Indexing performance tests with realtime branch
[ https://issues.apache.org/jira/browse/LUCENE-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020217#comment-13020217 ] Earwin Burrfoot commented on LUCENE-2571: - bq. Merges are NOT blocking indexing on trunk no matter which MP you use. Well.. merges tie up IO (especially if not on fancy SSDs/RAIDs), which in turn lags flushes -> bigger delays for stop the world flushes / lower bandwith cap (after which they are forced to stop the world) for parallel flushes. So Lance's point is partially valid. > Indexing performance tests with realtime branch > --- > > Key: LUCENE-2571 > URL: https://issues.apache.org/jira/browse/LUCENE-2571 > Project: Lucene - Java > Issue Type: Task > Components: Index >Reporter: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: wikimedium.realtime.Standard.nd10M_dps.png, > wikimedium.realtime.Standard.nd10M_dps_addDocuments.png, > wikimedium.realtime.Standard.nd10M_dps_addDocuments_flush.png, > wikimedium.trunk.Standard.nd10M_dps.png, > wikimedium.trunk.Standard.nd10M_dps_addDocuments.png > > > We should run indexing performance tests with the DWPT changes and compare to > trunk. > We need to test both single-threaded and multi-threaded performance. > NOTE: flush by RAM isn't implemented just yet, so either we wait with the > tests or flush by doc count. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2981) Review and potentially remove unused/unsupported Contribs
[ https://issues.apache.org/jira/browse/LUCENE-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014108#comment-13014108 ] Earwin Burrfoot commented on LUCENE-2981: - Bye-bye, DB. Few things can compete with it in pointlessness. > Review and potentially remove unused/unsupported Contribs > - > > Key: LUCENE-2981 > URL: https://issues.apache.org/jira/browse/LUCENE-2981 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Grant Ingersoll > Fix For: 3.2, 4.0 > > Attachments: LUCENE-2981.patch > > > Some of our contribs appear to be lacking for development/support or are > missing tests. We should review whether they are even pertinent these days > and potentially deprecate and remove them. > One of the things we did in Mahout when bringing in Colt code was to mark all > code that didn't have tests as @deprecated and then we removed the > deprecation once tests were added. Those that didn't get tests added over > about a 6 mos. period of time were removed. > I would suggest taking a hard look at: > ant > db > lucli > swing > (spatial should be gutted to some extent and moved to modules) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007136#comment-13007136 ] Earwin Burrfoot commented on LUCENE-2960: - You avoid deprecation/undeprecation and binary incompatibility, while incompatibly changing semantics. What do you win? > Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter > -- > > Key: LUCENE-2960 > URL: https://issues.apache.org/jira/browse/LUCENE-2960 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shay Banon >Priority: Blocker > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2960.patch > > > In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. > It would be great to be able to control that on a live IndexWriter. Other > possible two methods that would be great to bring back are > setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other > setters can actually be set on the MergePolicy itself, so no need for setters > for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007048#comment-13007048 ] Earwin Burrfoot commented on LUCENE-2960: - bq. Oh yeah. But then we'd clone the full IWC on every set... this seems like overkill in the name of "purity". So what? What exactly is overkill? Few wasted bytes and CPU ns for an object that's created a couple of times during application lifetime? There are also builders, which are very similar to what Steven is proposing. bq. Another thought is to offer all settings on the IWC for init convenience and exposure and then add javadoc about updaters on IW for those settings that can be changed on the fly That's exactly how I'd like to see it. > Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter > -- > > Key: LUCENE-2960 > URL: https://issues.apache.org/jira/browse/LUCENE-2960 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shay Banon >Priority: Blocker > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2960.patch > > > In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. > It would be great to be able to control that on a live IndexWriter. Other > possible two methods that would be great to bring back are > setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other > setters can actually be set on the MergePolicy itself, so no need for setters > for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006759#comment-13006759 ] Earwin Burrfoot commented on LUCENE-2960: - bq. infoStream is a PrintStream, which synchronizes anyway, so it should be safe to omit the volatile You're absolutely right here. bq. Yet, no "real" Java impl out there will ever do this since doing so will simply make that Java impl appear buggy. Sorry, but "real" Java impls do this. The case with endless get() happened on a map that was never modified after being created and set. Just one of the many JVM instances on many machines got unlucky after restart. bq. Well, and, it'd be bad for perf. – obviously the Java impl, CPU cache levels, should cache only frequently used things Java impls don't cache things. They do reorderings, they also keep final fields on registers, omitting reloads that happen for non-final ones, but no caching in JMM-related cases. Caching here is done by CPU, and it caches all data read from memory. bq. IWC cannot be made immutable – you build it up incrementally (new IWC(...).setThis(...).setThat(...)). Its fields cannot be final. Setters can return modified immutable copy of 'this'. So you get both incremental building and immutability. bq. How about this as a compromise: IW continues cloning the incoming IWC on init, as it does today. This means any changes to the IWC instance you passed to IW will have no effect on IW. What about earlier compromise mentioned by Shay, Mark, me? Keep setters for 'live' properties on IW. This clearly draws the line, and you don't have to consult Javadocs for each and every setting to know if you can change it live or not. > Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter > -- > > Key: LUCENE-2960 > URL: https://issues.apache.org/jira/browse/LUCENE-2960 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shay Banon >Priority: Blocker > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2960.patch > > > In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. > It would be great to be able to control that on a live IndexWriter. Other > possible two methods that would be great to bring back are > setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other > setters can actually be set on the MergePolicy itself, so no need for setters > for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006227#comment-13006227 ] Earwin Burrfoot commented on LUCENE-2960: - {quote} Why such purity? What do we gain? I'm all for purity, but only if it doesn't interfere w/ functionality. Here, it's taking away freedom... {quote} We gain consistency and predictability. And there are a lot of freedoms dangerous for developers. {quote} In fact it should be fine to share an IWC across multiple writers; you can change the RAM buffer for all of them at once. {quote} You've brought up a purrfect example of how NOT to do things. This is called 'action at a distance' and is a damn bug. Very annoying one. I've thoroughly experienced it with previous major version of Apache HTTPClient - they had an API that suggested you can set per-request timeouts, while these were actually global for a single Client instance. I fried my brain trying to understand why the hell random user requests timeout at hundred times their intended duration. Oh! It was an occasional admin request changing the global. You know, you can actually instantiate some DateRangeFilter with a couple of Dates, and then change these Dates (they are writeable) before each request. Isn't it an exciting kind of programming freedom? Or, back to our current discussion - we can pass RAMBufferSizeMB as an AtomicDouble, instead of current double, then we can use .set() on an instance we passed, and have our live reconfigurability. What's more - AtomicDouble protects us from word tearing! bq. I doubt there's any JVM out there where our lack-of-volatile infoStream causes any problems. Er.. While I have never personally witnessed unsynchronized long/double tearing, I've seen the consequence of unsafely publishing a HashMap - an endless loop on get(). It happened on your run off the mill Sun 1.6 JVM. So the bug is there, lying in wait. Maybe nobody ever actually used the freedom to change infoStream in-flight, or the guy was lucky, or in his particular situation the field was guarded by some unrelated sync. While I see banishing live reconfiguration from IW as a lost cause, I ask to make IWC immutable at the very least. As Shay said - this will provide a clear barrier between mutable and immutable properties. > Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter > -- > > Key: LUCENE-2960 > URL: https://issues.apache.org/jira/browse/LUCENE-2960 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shay Banon >Priority: Blocker > Fix For: 3.1, 4.0 > > > In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. > It would be great to be able to control that on a live IndexWriter. Other > possible two methods that would be great to bring back are > setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other > setters can actually be set on the MergePolicy itself, so no need for setters > for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005891#comment-13005891 ] Earwin Burrfoot commented on LUCENE-2960: - bq. Furthermore, closing the IW also forces you to commit, and I don't like tying changing of configuration to forcing a commit. Like I said, one isn't going to change his configuration five times a second. It's ok to commit from time to time? bq. So why should we force it to be unchangeable? That can only remove freedom, freedom that is perhaps valuable to an app somewhere. Each and every live reconfigurable setting adds to complexity. At the very least it requires proper synchronization. Take your SegmentWarmer example - you should make the field volatile. While it's possible to chicken out on primitive fields ([except long/double|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7]), as Yonik mentioned earlier, making nonvolatile mutable references introduces you to a world of hard-to-catch unsafe publication bugs (yes, infoStream is currently broken!). For more complex cases, certain on-change logic is required. And then you have to support this logic across all possible code rewrites and refactorings. > Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter > -- > > Key: LUCENE-2960 > URL: https://issues.apache.org/jira/browse/LUCENE-2960 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shay Banon >Priority: Blocker > Fix For: 3.1, 4.0 > > > In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. > It would be great to be able to control that on a live IndexWriter. Other > possible two methods that would be great to bring back are > setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other > setters can actually be set on the MergePolicy itself, so no need for setters > for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005617#comment-13005617 ] Earwin Burrfoot commented on LUCENE-2960: - As I said on the list - if one needs to change IW config, he can always recreate IW with new settings. Such changes cannot happen often enough for recreation to affect indexing performance. The fact that you can change IW's behaviour post-construction by modifying unrelated IWC instance is frightening. IW should either make a private copy of IWC when constructing, or IWC should be made immutable. > Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter > -- > > Key: LUCENE-2960 > URL: https://issues.apache.org/jira/browse/LUCENE-2960 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shay Banon >Priority: Blocker > Fix For: 3.1, 4.0 > > > In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. > It would be great to be able to control that on a live IndexWriter. Other > possible two methods that would be great to bring back are > setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other > setters can actually be set on the MergePolicy itself, so no need for setters > for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2908) clean up serialization in the codebase
[ https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994769#comment-12994769 ] Earwin Burrfoot commented on LUCENE-2908: - Oh, damn :) On my project, we specifically use java-serialization to pass configured Queries/Filters between cluster nodes, as it saves us HEAPS of wrapping/unwrapping them into some parallel serializable classes. > clean up serialization in the codebase > -- > > Key: LUCENE-2908 > URL: https://issues.apache.org/jira/browse/LUCENE-2908 > Project: Lucene - Java > Issue Type: Task >Reporter: Robert Muir >Assignee: Robert Muir > Fix For: 4.0 > > Attachments: LUCENE-2908.patch > > > We removed contrib/remote, but forgot to cleanup serialization hell > everywhere. > this is no longer needed, never really worked (e.g. across versions), and > slows > development (e.g. i wasted a long time debugging stupid serialization of > Similarity.idfExplain when trying to make a patch for the scoring system). -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2871) Use FileChannel in FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984222#action_12984222 ] Earwin Burrfoot commented on LUCENE-2871: - Before arguing where to put this new IndexOutput, I think it's wise to have a benchmark proving we need it at all. I have serious doubts FileChannel's going to outperform RAF.write(). Why should it? And for the purporses of benchmark it can be anywhere. > Use FileChannel in FSDirectory > -- > > Key: LUCENE-2871 > URL: https://issues.apache.org/jira/browse/LUCENE-2871 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Shay Banon > Attachments: LUCENE-2871.patch, LUCENE-2871.patch > > > Explore using FileChannel in FSDirectory to see if it improves write > operations performance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly
[ https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983162#action_12983162 ] Earwin Burrfoot commented on LUCENE-2657: - Thanks, but I'm not the one confused here. : ) > Replace Maven POM templates with full POMs, and change documentation > accordingly > > > Key: LUCENE-2657 > URL: https://issues.apache.org/jira/browse/LUCENE-2657 > Project: Lucene - Java > Issue Type: Improvement > Components: Build >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe >Assignee: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, > LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, > LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, > LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch > > > The current Maven POM templates only contain dependency information, the bare > bones necessary for uploading artifacts to the Maven repository. > The full Maven POMs in the attached patch include the information necessary > to run a multi-module Maven build, in addition to serving the same purpose as > the current POM templates. > Several dependencies are not available through public maven repositories. A > profile in the top-level POM can be activated to install these dependencies > from the various {{lib/}} directories into your local repository. From the > top-level directory: > {code} > mvn -N -Pbootstrap install > {code} > Once these non-Maven dependencies have been installed, to run all Lucene/Solr > tests via Maven's surefire plugin, and populate your local repository with > all artifacts, from the top level directory, run: > {code} > mvn install > {code} > When one Lucene/Solr module depends on another, the dependency is declared on > the *artifact(s)* produced by the other module and deposited in your local > repository, rather than on the other module's un-jarred compiler output in > the {{build/}} directory, so you must run {{mvn install}} on the other module > before its changes are visible to the module that depends on it. > To create all the artifacts without running tests: > {code} > mvn -DskipTests install > {code} > I almost always include the {{clean}} phase when I do a build, e.g.: > {code} > mvn -DskipTests clean install > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly
[ https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983160#action_12983160 ] Earwin Burrfoot commented on LUCENE-2657: - bq. we need to be very clear and it has no effect on artifacts I feel something was missed in the heat of debate. Eg: bq. The latest patch on this release uses the Ant artifacts directly. bq. This patch uses the Ant-produced artifacts to prepare for Maven artifact publishing. bq. Maven itself is not invoked in the process. An Ant plugin handles the artifact deployment. I will now try to decipher these quotes. It seems the patch takes the artifacts produced by Ant, as a part of our usual (and only) build process, and shoves it down Maven repository's throat along with a bunch of pom-descriptors. Nothing else is happening. Also, after everything that has been said, I think nobody in his right mind will *force* anyone to actually use the Ant target in question as a part of release. But it's nice to have it around, in case some user-friendly commiter would like to push (I'd like to reiterate - ant generated) artifacts into Maven. > Replace Maven POM templates with full POMs, and change documentation > accordingly > > > Key: LUCENE-2657 > URL: https://issues.apache.org/jira/browse/LUCENE-2657 > Project: Lucene - Java > Issue Type: Improvement > Components: Build >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe >Assignee: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, > LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, > LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, > LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch > > > The current Maven POM templates only contain dependency information, the bare > bones necessary for uploading artifacts to the Maven repository. > The full Maven POMs in the attached patch include the information necessary > to run a multi-module Maven build, in addition to serving the same purpose as > the current POM templates. > Several dependencies are not available through public maven repositories. A > profile in the top-level POM can be activated to install these dependencies > from the various {{lib/}} directories into your local repository. From the > top-level directory: > {code} > mvn -N -Pbootstrap install > {code} > Once these non-Maven dependencies have been installed, to run all Lucene/Solr > tests via Maven's surefire plugin, and populate your local repository with > all artifacts, from the top level directory, run: > {code} > mvn install > {code} > When one Lucene/Solr module depends on another, the dependency is declared on > the *artifact(s)* produced by the other module and deposited in your local > repository, rather than on the other module's un-jarred compiler output in > the {{build/}} directory, so you must run {{mvn install}} on the other module > before its changes are visible to the module that depends on it. > To create all the artifacts without running tests: > {code} > mvn -DskipTests install > {code} > I almost always include the {{clean}} phase when I do a build, e.g.: > {code} > mvn -DskipTests clean install > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly
[ https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983152#action_12983152 ] Earwin Burrfoot commented on LUCENE-2657: - I am *amazed* at how idea/eclipse ant tasks managed to get into trunk. No, really, emacs is just enough for Lucene development, as proven by one of our most seasoned developers. So why the hell are those extras polluting main tree? > Replace Maven POM templates with full POMs, and change documentation > accordingly > > > Key: LUCENE-2657 > URL: https://issues.apache.org/jira/browse/LUCENE-2657 > Project: Lucene - Java > Issue Type: Improvement > Components: Build >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe >Assignee: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, > LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, > LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, > LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch > > > The current Maven POM templates only contain dependency information, the bare > bones necessary for uploading artifacts to the Maven repository. > The full Maven POMs in the attached patch include the information necessary > to run a multi-module Maven build, in addition to serving the same purpose as > the current POM templates. > Several dependencies are not available through public maven repositories. A > profile in the top-level POM can be activated to install these dependencies > from the various {{lib/}} directories into your local repository. From the > top-level directory: > {code} > mvn -N -Pbootstrap install > {code} > Once these non-Maven dependencies have been installed, to run all Lucene/Solr > tests via Maven's surefire plugin, and populate your local repository with > all artifacts, from the top level directory, run: > {code} > mvn install > {code} > When one Lucene/Solr module depends on another, the dependency is declared on > the *artifact(s)* produced by the other module and deposited in your local > repository, rather than on the other module's un-jarred compiler output in > the {{build/}} directory, so you must run {{mvn install}} on the other module > before its changes are visible to the module that depends on it. > To create all the artifacts without running tests: > {code} > mvn -DskipTests install > {code} > I almost always include the {{clean}} phase when I do a build, e.g.: > {code} > mvn -DskipTests clean install > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2755) Some improvements to CMS
[ https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982564#action_12982564 ] Earwin Burrfoot commented on LUCENE-2755: - bq. if you still want to work on it, the I can keep the issue open and mark it 3.2 (unless you want to give it a try in 3.1). I'll start another later, so please, go on. > Some improvements to CMS > > > Key: LUCENE-2755 > URL: https://issues.apache.org/jira/browse/LUCENE-2755 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2755.patch > > > While running optimize on a large index, I've noticed several things that got > me to read CMS code more carefully, and find these issues: > * CMS may hold onto a merge if maxMergeCount is hit. That results in the > MergeThreads taking merges from the IndexWriter until they are exhausted, and > only then that blocked merge will run. I think it's unnecessary that that > merge will be blocked. > * CMS sorts merges by segments size, doc-based and not bytes-based. Since the > default MP is LogByteSizeMP, and I hardly believe people care about doc-based > size segments anymore, I think we should switch the default impl. There are > two ways to make it extensible, if we want: > ** Have an overridable member/method in CMS that you can extend and override > - easy. > ** Have OneMerge be comparable and let the MP determine the order (e.g. by > bytes, docs, calibrate deletes etc.). Better, but will need to tap into > several places in the code, so more risky and complicated. > On the go, I'd like to add some documentation to CMS - it's not very easy to > read and follow. > I'll work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2374) Add introspection API to AttributeSource/AttributeImpl
[ https://issues.apache.org/jira/browse/LUCENE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982440#action_12982440 ] Earwin Burrfoot commented on LUCENE-2374: - Another step in the same direction then. Instead of bq. The only problem with this aproach would be that the attribute keys must be unique - an idea would be to prefix them with the attribute name. Let us define interface as - void reflect(Class attributeClass, String key, Object value) ? If the client code then wants to call toString() on attributeClass and concat with key - it's free to do so. If it wants to be more creative - it can. > Add introspection API to AttributeSource/AttributeImpl > -- > > Key: LUCENE-2374 > URL: https://issues.apache.org/jira/browse/LUCENE-2374 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.1, 4.0 > > > AttributeSource/TokenStream inspection in Solr needs to have some insight > into the contents of AttributeImpls. As LUCENE-2302 has some problems with > toString() [which is not structured and conflicts with CharSequence's > definition for CharTermAttribute], I propose an simple API that get a default > implementation in AttributeImpl (just like toString() current): > - Iterator> AttributeImpl.contentsIterator() returns an > iterator (for most attributes its a singleton) of a key-value pair, e.g. > "term"->"foobar","startOffset"->Integer.valueOf(0),... > - AttributeSource gets the same method, it just concat the iterators of each > getAttributeImplsIterator() AttributeImpl > No backwards problems occur, as the default toString() method will work like > before (it just gets iterator and lists), but we simply remove the > documentation for the format. (Char)TermAttribute gets a special impl fo > toString() according to CharSequence and a corresponding iterator. > I also want to remove the abstract hashCode() and equals() methods from > AttributeImpl, as they are not needed and just create work for the > implementor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2374) Add introspection API to AttributeSource/AttributeImpl
[ https://issues.apache.org/jira/browse/LUCENE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982437#action_12982437 ] Earwin Burrfoot commented on LUCENE-2374: - Nice. Except maybe introduce a simple interface instead of the Map ? {code} interface AttributeReflector { // Name is crap, should be changed void reflect(String key, Object value); } void reflectWith(AttributeReflector reflector); {code} You have no need for fake maps then, both in toString(), and in user code. > Add introspection API to AttributeSource/AttributeImpl > -- > > Key: LUCENE-2374 > URL: https://issues.apache.org/jira/browse/LUCENE-2374 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.1, 4.0 > > > AttributeSource/TokenStream inspection in Solr needs to have some insight > into the contents of AttributeImpls. As LUCENE-2302 has some problems with > toString() [which is not structured and conflicts with CharSequence's > definition for CharTermAttribute], I propose an simple API that get a default > implementation in AttributeImpl (just like toString() current): > - Iterator> AttributeImpl.contentsIterator() returns an > iterator (for most attributes its a singleton) of a key-value pair, e.g. > "term"->"foobar","startOffset"->Integer.valueOf(0),... > - AttributeSource gets the same method, it just concat the iterators of each > getAttributeImplsIterator() AttributeImpl > No backwards problems occur, as the default toString() method will work like > before (it just gets iterator and lists), but we simply remove the > documentation for the format. (Char)TermAttribute gets a special impl fo > toString() according to CharSequence and a corresponding iterator. > I also want to remove the abstract hashCode() and equals() methods from > AttributeImpl, as they are not needed and just create work for the > implementor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982166#action_12982166 ] Earwin Burrfoot commented on LUCENE-2858: - APIs have to be there still. All that commity, segment-deletery, mutabley stuff (that spans both atomic and composite readers). So, while your plan is viable, it won't remove that much cruft. > Separate SegmentReaders (and other atomic readers) from composite IndexReaders > -- > > Key: LUCENE-2858 > URL: https://issues.apache.org/jira/browse/LUCENE-2858 > Project: Lucene - Java > Issue Type: Task >Reporter: Uwe Schindler > Fix For: 4.0 > > > With current trunk, whenever you open an IndexReader on a directory you get > back a DirectoryReader which is a composite reader. The interface of > IndexReader has now lots of methods that simply throw UOE (in fact more than > 50% of all methods that are commonly used ones are unuseable now). This > confuses users and makes the API hard to understand. > This issue should split "atomic readers" from "reader collections" with a > separate API. After that, you are no longer able, to get TermsEnum without > wrapping from those composite readers. We currently have helper classes for > wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or > Multi*), those should be retrofitted to implement the correct classes > (SlowMultiReaderWrapper would be an atomic reader but takes a composite > reader as ctor param, maybe it could also simply take a List). > In my opinion, maybe composite readers could implement some collection APIs > and also have the ReaderUtil method directly built in (possibly as a "view" > in the util.Collection sense). In general composite readers do not really > need to look like the previous IndexReaders, they could simply be a > "collection" of SegmentReaders with some functionality like reopen. > On the other side, atomic readers do not need reopen logic anymore? When a > segment changes, you need a new atomic reader? - maybe because of deletions > thats not the best idea, but we should investigate. Maybe make the whole > reopen logic simplier to use (ast least on the collection reader level). > We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982132#action_12982132 ] Earwin Burrfoot commented on LUCENE-2858: - bq. Still, i think we would need this method (somewhere) even with CSF, so that people can change the norms and they instantly take effect for searches. This still puzzles me. I can strain my imagination, and get people who just need to change norms without reindexing. But doing this and *requiring* instant turnaround? Kid me not :) > Separate SegmentReaders (and other atomic readers) from composite IndexReaders > -- > > Key: LUCENE-2858 > URL: https://issues.apache.org/jira/browse/LUCENE-2858 > Project: Lucene - Java > Issue Type: Task >Reporter: Uwe Schindler > Fix For: 4.0 > > > With current trunk, whenever you open an IndexReader on a directory you get > back a DirectoryReader which is a composite reader. The interface of > IndexReader has now lots of methods that simply throw UOE (in fact more than > 50% of all methods that are commonly used ones are unuseable now). This > confuses users and makes the API hard to understand. > This issue should split "atomic readers" from "reader collections" with a > separate API. After that, you are no longer able, to get TermsEnum without > wrapping from those composite readers. We currently have helper classes for > wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or > Multi*), those should be retrofitted to implement the correct classes > (SlowMultiReaderWrapper would be an atomic reader but takes a composite > reader as ctor param, maybe it could also simply take a List). > In my opinion, maybe composite readers could implement some collection APIs > and also have the ReaderUtil method directly built in (possibly as a "view" > in the util.Collection sense). In general composite readers do not really > need to look like the previous IndexReaders, they could simply be a > "collection" of SegmentReaders with some functionality like reopen. > On the other side, atomic readers do not need reopen logic anymore? When a > segment changes, you need a new atomic reader? - maybe because of deletions > thats not the best idea, but we should investigate. Maybe make the whole > reopen logic simplier to use (ast least on the collection reader level). > We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982126#action_12982126 ] Earwin Burrfoot commented on LUCENE-2858: - bq. Any comments about removing write access from IndexReaders? I think setNorms() will be removed soon, but how about the others like deleteDocument()? I would propose to also make all IndexReaders simply readers not writers? Voting with all my extremities - yes!! > Separate SegmentReaders (and other atomic readers) from composite IndexReaders > -- > > Key: LUCENE-2858 > URL: https://issues.apache.org/jira/browse/LUCENE-2858 > Project: Lucene - Java > Issue Type: Task >Reporter: Uwe Schindler > Fix For: 4.0 > > > With current trunk, whenever you open an IndexReader on a directory you get > back a DirectoryReader which is a composite reader. The interface of > IndexReader has now lots of methods that simply throw UOE (in fact more than > 50% of all methods that are commonly used ones are unuseable now). This > confuses users and makes the API hard to understand. > This issue should split "atomic readers" from "reader collections" with a > separate API. After that, you are no longer able, to get TermsEnum without > wrapping from those composite readers. We currently have helper classes for > wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or > Multi*), those should be retrofitted to implement the correct classes > (SlowMultiReaderWrapper would be an atomic reader but takes a composite > reader as ctor param, maybe it could also simply take a List). > In my opinion, maybe composite readers could implement some collection APIs > and also have the ReaderUtil method directly built in (possibly as a "view" > in the util.Collection sense). In general composite readers do not really > need to look like the previous IndexReaders, they could simply be a > "collection" of SegmentReaders with some functionality like reopen. > On the other side, atomic readers do not need reopen logic anymore? When a > segment changes, you need a new atomic reader? - maybe because of deletions > thats not the best idea, but we should investigate. Maybe make the whole > reopen logic simplier to use (ast least on the collection reader level). > We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981774#action_12981774 ] Earwin Burrfoot commented on LUCENE-2868: - We here use an intermediate query AST, with a number of walkers that do synonym substitution, optimization, caching, rewriting for multiple fields, and finally - generating a tree of Lucene Queries. I can share a generic reflection-based visitor that's somewhat more handy than default visitor pattern in java. Usage looks roughly like: {code} class ToStringWalker extends DispatchingVisitor { // String here stands for the type of walk result String visit(TermQuery q) { return "{term: " + q.getTerm() + "}"; } String visit(BooleanQuery q) { StringBuffer buf = new StringBuffer(); buf.append("{boolean: "); for (BooleanQuery.Clause clause: q.clauses()) { buf.append(dispatch(clause.getQuery()).append(", "); // Here we } buf.append("}"); return buf.toString(); } String visit(SpanQuery q) { // Runs for all SpanQueries . } String visit(Query q) { // Runs for all Queries not covered by a more exact visit() method .. } } Query query = ...; String stringRepresentation = new ToStringWalker().dispatch(query); {code} dispatch() checks its parameter runtime type, picks a visit()'s most close overload (according to java rules for compile-time overloaded method resolution), and invokes it. > It should be easy to make use of TermState; rewritten queries should be > shared automatically > > > Key: LUCENE-2868 > URL: https://issues.apache.org/jira/browse/LUCENE-2868 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Reporter: Karl Wright > Attachments: query-rewriter.patch > > > When you have the same query in a query hierarchy multiple times, tremendous > savings can now be had if the user knows enough to share the rewritten > queries in the hierarchy, due to the TermState addition. But this is clumsy > and requires a lot of coding by the user to take advantage of. Lucene should > be smart enough to share the rewritten queries automatically. > This can be most readily (and powerfully) done by introducing a new method to > Query.java: > Query rewriteUsingCache(IndexReader indexReader) > ... and including a caching implementation right in Query.java which would > then work for all. Of course, all callers would want to use this new method > rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981388#action_12981388 ] Earwin Burrfoot commented on LUCENE-2324: - Maan, this comment list is infinite. How do I currently get the ..er.. current version? Latest branch + latest Jason's patch? Regardless of everything else, I'd ask you not to extend random things :) at least if you can't say is-a about them. DocumentsWriterPerThreadPool.ThreadState IS A ReentrantLock? No. So you're better off encapsulating it rather than extending. Same can be applied to SegmentInfos that extends Vector :/ > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Realtime Branch > > Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, > lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2863) Updating a documenting looses its fields that only indexed, also NumericField tries are completely lost
[ https://issues.apache.org/jira/browse/LUCENE-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980965#action_12980965 ] Earwin Burrfoot commented on LUCENE-2863: - updateDocument() is an atomic version of deleteDocument() + addDocument(), nothing more and there's nothing surprising you lose your fields if you delete the doc and don't add them back later. > Updating a documenting looses its fields that only indexed, also NumericField > tries are completely lost > --- > > Key: LUCENE-2863 > URL: https://issues.apache.org/jira/browse/LUCENE-2863 > Project: Lucene - Java > Issue Type: Bug > Components: Store >Affects Versions: 3.0.2, 3.0.3 > Environment: WindowsXP, Java1.6.20 using a RamDirectory >Reporter: Tamas Sandor >Priority: Blocker > > I have a code snippet (see below) which creates a new document with standard > (stored, indexed), *not-stored, indexed-only* and some *NumericFields*. Then > it updates the document via adding a new string field. The result is that all > the fields that are not stored but indexed-only and especially NumericFields > the trie tokens are completly lost from index after update or delete/add. > {code:java} > Directory ramDir = new RamDirectory(); > IndexWriter writer = new IndexWriter(ramDir, new WhitespaceAnalyzer(), > MaxFieldLength.UNLIMITED); > Document doc = new Document(); > doc.add(new Field("ID", "HO1234", Store.YES, Index.NOT_ANALYZED_NO_NORMS)); > doc.add(new Field("PATTERN", "HELLO", Store.NO, Index.NOT_ANALYZED_NO_NORMS)); > doc.add(new NumericField("LAT", Store.YES, > true).setDoubleValue(51.48826603066d)); > doc.add(new NumericField("LNG", Store.YES, > true).setDoubleValue(-0.08913399651646614d)); > writer.addDocument(doc); > doc = new Document(); > doc.add(new Field("ID", "HO", Store.YES, Index.NOT_ANALYZED_NO_NORMS)); > doc.add(new Field("PATTERN", "BELLO", Store.NO, Index.NOT_ANALYZED_NO_NORMS)); > doc.add(new NumericField("LAT", Store.YES, > true).setDoubleValue(101.48826603066d)); > doc.add(new NumericField("LNG", Store.YES, > true).setDoubleValue(-100.08913399651646614d)); > writer.addDocument(doc); > Term t = new Term("ID", "HO1234"); > Query q = new TermQuery(t); > IndexSearcher seacher = new IndexSearcher(writer.getReader()); > TopDocs hits = seacher.search(q, 1); > if (hits.scoreDocs.length > 0) { > Document ndoc = seacher.doc(hits.scoreDocs[0].doc); > ndoc.add(new Field("FINAL", "FINAL", Store.YES, > Index.NOT_ANALYZED_NO_NORMS)); > writer.updateDocument(t, ndoc); > // writer.deleteDocuments(q); > // writer.addDocument(ndoc); > } else { > LOG.info("Couldn't find the document via the query"); > } > seacher = new IndexSearcher(writer.getReader()); > hits = seacher.search(new TermQuery(new Term("PATTERN", "HELLO")), 1); > LOG.info("_hits HELLO:" + hits.totalHits); // should be 1 but it's 0 > writer.close(); > {code} > And I have a boundingbox query based on *NumericRangeQuery*. After the > document update it doesn't return any hit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980736#action_12980736 ] Earwin Burrfoot commented on LUCENE-2793: - {quote} As I said before though, i wouldn't mind if we had something more like a 'modules/native' and FSDirectory checked, if this was available and automagically used it... but I can't see myself thinking that we should put this logic into fsdir itself, sorry. {quote} I'm perfectly OK with that approach (having some module FSDir checks). I also feel uneasy having JNI in core. What I don't want to see, is Directory impls that you can't use on their own. If you can only use it for merging, then it's not a Directory, it breaks the contract! - move the code elsewhere. > Directory createOutput and openInput should take an IOContext > - > > Key: LUCENE-2793 > URL: https://issues.apache.org/jira/browse/LUCENE-2793 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless > Attachments: LUCENE-2793.patch > > > Today for merging we pass down a larger readBufferSize than for searching > because we get better performance. > I think we should generalize this to a class (IOContext), which would hold > the buffer size, but then could hold other flags like DIRECT (bypass OS's > buffer cache), SEQUENTIAL, etc. > Then, we can make the DirectIOLinuxDirectory fully usable because we would > only use DIRECT/SEQUENTIAL during merging. > This will require fixing how IW pools readers, so that a reader opened for > merging is not then used for searching, and vice/versa. Really, it's only > all the open file handles that need to be different -- we could in theory > share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980732#action_12980732 ] Earwin Burrfoot commented on LUCENE-2793: - bq. Because in your example code above, it looks like it's added to Directory itself. bq. My problem with your sample code is that it appears that the .setBufferSize method is on Directory itself. Ohoho. My fault, sorry. It should look like: {code} RAMDirectory ramDir = new RAMDirectory(); ramDir.setBufferSize(whatever) // Compilation error! ramDir.createIndexInput(name, context); NIOFSDirectory fsDir = new NIOFSDirectory(); fsDir.setBufferSize(IOContext.NORMAL_READ, 1024); fsDir.setBufferSize(IOContext.MERGE, 4096); fsDir.createIndexInput(name, context) {code} > Directory createOutput and openInput should take an IOContext > - > > Key: LUCENE-2793 > URL: https://issues.apache.org/jira/browse/LUCENE-2793 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless > Attachments: LUCENE-2793.patch > > > Today for merging we pass down a larger readBufferSize than for searching > because we get better performance. > I think we should generalize this to a class (IOContext), which would hold > the buffer size, but then could hold other flags like DIRECT (bypass OS's > buffer cache), SEQUENTIAL, etc. > Then, we can make the DirectIOLinuxDirectory fully usable because we would > only use DIRECT/SEQUENTIAL during merging. > This will require fixing how IW pools readers, so that a reader opened for > merging is not then used for searching, and vice/versa. Really, it's only > all the open file handles that need to be different -- we could in theory > share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980649#action_12980649 ] Earwin Burrfoot commented on LUCENE-2793: - What's with ongoing crazyness? :) bq. DirectIOLinuxDirectory First you introduce a kind of directory that is utterly useless except certain special situations. Then, instead of fixing the directory/folding its code somewhere normal, you try to workaround by switching between directories. What's the point of using abstract classes or interfaces, if you leak their implementation's logic all over the place? Or making DIOLD wrap something. Yeah! Wrap my RAMDir! bq. bufferSize This value is only meaningful to a certain subset of Directory implementations. So the only logical place we want to see this value set - is these very impls. Sample code: {code} Directory ramDir = new RAMDirectory(); ramDir.createIndexInput(name, context); // See, ma? No bufferSizes, they are pointless for RAMDir Directory fsDir = new NIOFSDirectory(); fsDir.setBufferSize(IOContext.NORMAL_READ, 1024); fsDir.setBufferSize(IOContext.MERGE, 4096); fsDir.createIndexInput(name, context) // See, ma? The only one who's really concerned with 'actual' buffer size is this concrete Directory impl // All client code is only concerned with the context. // It's NIOFSDirectory's business to give meaningful interpretation for IOContext and assign the buffer sizes. {code} You don't need custom Directory impls to make DIOLD work, you should freakin' fix it. The proper way is to test out the things, and then move DirectIO code to the only place it makes sense in - FSDir? Probably make it switch on/off-able, maybe not. You don't need custom Directory impls to set buffer sizes (neither cast to BufferedIndexInput!), you should add the setting to these Directories, which make sense of it. > Directory createOutput and openInput should take an IOContext > - > > Key: LUCENE-2793 > URL: https://issues.apache.org/jira/browse/LUCENE-2793 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless > Attachments: LUCENE-2793.patch > > > Today for merging we pass down a larger readBufferSize than for searching > because we get better performance. > I think we should generalize this to a class (IOContext), which would hold > the buffer size, but then could hold other flags like DIRECT (bypass OS's > buffer cache), SEQUENTIAL, etc. > Then, we can make the DirectIOLinuxDirectory fully usable because we would > only use DIRECT/SEQUENTIAL during merging. > This will require fixing how IW pools readers, so that a reader opened for > merging is not then used for searching, and vice/versa. Really, it's only > all the open file handles that need to be different -- we could in theory > share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980458#action_12980458 ] Earwin Burrfoot commented on LUCENE-2793: - In fact, I suggest dropping bufferSize altogether. As far as I can recall, it was introduced as a precursor to IOContext and can now be safely replaced. Even if we want to give user control over buffer size for all streams, or only those opened in specific IOContext, he can pass these numbers as config parameters to his Directory impl. That makes total sense, as: 1. IndexWriter/IndexReader couldn't care less about buffer sizes, it just passes them to the Directory. It's not their concern. 2. A bunch of Directories doesn't use said bufferSize at all, making this parameter not only private Directory affairs, but even further - implementation-specific. So my bet is - introduce IOContext as a simple Enum, change bufferSize parameter on createInput/Output to IOContext, done. > Directory createOutput and openInput should take an IOContext > - > > Key: LUCENE-2793 > URL: https://issues.apache.org/jira/browse/LUCENE-2793 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless > Attachments: LUCENE-2793.patch > > > Today for merging we pass down a larger readBufferSize than for searching > because we get better performance. > I think we should generalize this to a class (IOContext), which would hold > the buffer size, but then could hold other flags like DIRECT (bypass OS's > buffer cache), SEQUENTIAL, etc. > Then, we can make the DirectIOLinuxDirectory fully usable because we would > only use DIRECT/SEQUENTIAL during merging. > This will require fixing how IW pools readers, so that a reader opened for > merging is not then used for searching, and vice/versa. Really, it's only > all the open file handles that need to be different -- we could in theory > share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980454#action_12980454 ] Earwin Burrfoot commented on LUCENE-2793: - {quote} bq. You get IOFactory from Directory That's for the default, the main use is the static IOFactory class. {quote} You lost me here. If you got A from B, you don't have to pass B again to invoke A, if you do - that's 99% a design mistake. But still, my point was that you don't need IOFactory at all. bq. Right, however we're basically trying to intermix Directory's, which doesn't work when pointed at the same underlying File. I thought about a meta-Directory that routes based on the IOContext, however we'd still need a way to create an IndexInput and IndexOutput, from different Directory implementations. What Directories are you trying to intermix? What for? I thought the only thing done in that issue is an attempt to give Directory hints as to why we're going to open its streams. A simple enum IOContext and extra parameter on createOutput/Input would suffice. But with Lucene's micromanagement attitude, an enum turns into slightly more complex thing, with bufferSizes and whatnot. Still - no need for mixing Directories. > Directory createOutput and openInput should take an IOContext > - > > Key: LUCENE-2793 > URL: https://issues.apache.org/jira/browse/LUCENE-2793 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless > Attachments: LUCENE-2793.patch > > > Today for merging we pass down a larger readBufferSize than for searching > because we get better performance. > I think we should generalize this to a class (IOContext), which would hold > the buffer size, but then could hold other flags like DIRECT (bypass OS's > buffer cache), SEQUENTIAL, etc. > Then, we can make the DirectIOLinuxDirectory fully usable because we would > only use DIRECT/SEQUENTIAL during merging. > This will require fixing how IW pools readers, so that a reader opened for > merging is not then used for searching, and vice/versa. Really, it's only > all the open file handles that need to be different -- we could in theory > share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2856) Create IndexWriter event listener, specifically for merges
[ https://issues.apache.org/jira/browse/LUCENE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980448#action_12980448 ] Earwin Burrfoot commented on LUCENE-2856: - A SegmentListener that has a number of children SLs and delegates eventHappened() calls to them. > Create IndexWriter event listener, specifically for merges > -- > > Key: LUCENE-2856 > URL: https://issues.apache.org/jira/browse/LUCENE-2856 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 4.0 >Reporter: Jason Rutherglen > Attachments: LUCENE-2856.patch > > > The issue will allow users to monitor merges occurring within IndexWriter > using a callback notifier event listener. This can be used by external > applications such as Solr to monitor large segment merges. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980400#action_12980400 ] Earwin Burrfoot commented on LUCENE-2793: - Looks crazy. In a -bad- tangled way. You get IOFactory from Directory, put into IOContext, and then invoke it, passing it (wow!) an IOContext and a Directory. What if you pass totally different Directory? Different IOContext? It blows up eerily. And there's no justification for this - we already have an IOFactory, it's called Directory! It just needs an extra parameter on its factory methods (createInput/Output), that's all. > Directory createOutput and openInput should take an IOContext > - > > Key: LUCENE-2793 > URL: https://issues.apache.org/jira/browse/LUCENE-2793 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Michael McCandless > Attachments: LUCENE-2793.patch > > > Today for merging we pass down a larger readBufferSize than for searching > because we get better performance. > I think we should generalize this to a class (IOContext), which would hold > the buffer size, but then could hold other flags like DIRECT (bypass OS's > buffer cache), SEQUENTIAL, etc. > Then, we can make the DirectIOLinuxDirectory fully usable because we would > only use DIRECT/SEQUENTIAL during merging. > This will require fixing how IW pools readers, so that a reader opened for > merging is not then used for searching, and vice/versa. Really, it's only > all the open file handles that need to be different -- we could in theory > share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2856) Create IndexWriter event listener, specifically for merges
[ https://issues.apache.org/jira/browse/LUCENE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980390#action_12980390 ] Earwin Burrfoot commented on LUCENE-2856: - A CompositeSegmentListener niftily removes the need for collection. > Create IndexWriter event listener, specifically for merges > -- > > Key: LUCENE-2856 > URL: https://issues.apache.org/jira/browse/LUCENE-2856 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 4.0 >Reporter: Jason Rutherglen > Attachments: LUCENE-2856.patch > > > The issue will allow users to monitor merges occurring within IndexWriter > using a callback notifier event listener. This can be used by external > applications such as Solr to monitor large segment merges. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980388#action_12980388 ] Earwin Burrfoot commented on LUCENE-2858: - bq. On the other side, atomic readers do not need reopen logic anymore? When a segment changes, you need a new atomic reader? There is a freakload of places that "upgrade" SegmentReader in various ways, with deletions guilty only for the part of the cases. I'll try getting back to LUCENE-2355 at the end of the week. > Separate SegmentReaders (and other atomic readers) from composite IndexReaders > -- > > Key: LUCENE-2858 > URL: https://issues.apache.org/jira/browse/LUCENE-2858 > Project: Lucene - Java > Issue Type: Task >Reporter: Uwe Schindler > Fix For: 4.0 > > > With current trunk, whenever you open an IndexReader on a directory you get > back a DirectoryReader which is a composite reader. The interface of > IndexReader has now lots of methods that simply throw UOE (in fact more than > 50% of all methods that are commonly used ones are unuseable now). This > confuses users and makes the API hard to understand. > This issue should split "atomic readers" from "reader collections" with a > separate API. After that, you are no longer able, to get TermsEnum without > wrapping from those composite readers. We currently have helper classes for > wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or > Multi*), those should be retrofitted to implement the correct classes > (SlowMultiReaderWrapper would be an atomic reader but takes a composite > reader as ctor param, maybe it could also simply take a List). > In my opinion, maybe composite readers could implement some collection APIs > and also have the ReaderUtil method directly built in (possibly as a "view" > in the util.Collection sense). In general composite readers do not really > need to look like the previous IndexReaders, they could simply be a > "collection" of SegmentReaders with some functionality like reopen. > On the other side, atomic readers do not need reopen logic anymore? When a > segment changes, you need a new atomic reader? - maybe because of deletions > thats not the best idea, but we should investigate. Maybe make the whole > reopen logic simplier to use (ast least on the collection reader level). > We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2474) Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean custom caches that use the IndexReader (getFieldCacheKey)
[ https://issues.apache.org/jira/browse/LUCENE-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979888#action_12979888 ] Earwin Burrfoot commented on LUCENE-2474: - bq. Earwin's working on improving this, I think, under LUCENE-2355 I stalled, and then there were just so many changes under trunk, so I have to restart now :) Thanks for another kick. > Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean > custom caches that use the IndexReader (getFieldCacheKey) > > > Key: LUCENE-2474 > URL: https://issues.apache.org/jira/browse/LUCENE-2474 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shay Banon > Attachments: LUCENE-2474.patch, LUCENE-2474.patch > > > Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean > custom caches that use the IndexReader (getFieldCacheKey). > A spin of: https://issues.apache.org/jira/browse/LUCENE-2468. Basically, its > make a lot of sense to cache things based on IndexReader#getFieldCacheKey, > even Lucene itself uses it, for example, with the CachingWrapperFilter. > FieldCache enjoys being called explicitly to purge its cache when possible > (which is tricky to know from the "outside", especially when using NRT - > reader attack of the clones). > The provided patch allows to plug a CacheEvictionListener which will be > called when the cache should be purged for an IndexReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979522#action_12979522 ] Earwin Burrfoot commented on LUCENE-2312: - Some questions to align myself with impending reality. Is that right that future RT readers are no longer immutable snapshots (in a sense that they have variable maxDoc)? If it is so, are you keeping current NRT mode, with fast turnaround, yet immutable readers? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: Realtime Branch >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: Realtime Branch > > Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
[ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979366#action_12979366 ] Earwin Burrfoot commented on LUCENE-2843: - bq. Nope, havent looked at their code... i think i stopped at the documentation when i saw how they analyzed text! All my points are contained within their documentation. No need to look at the code (it's as shady as Lucene's). In the same manner, Lucene had crappy analyzis for years, until you've taken hold of (unicode) police baton. So let's not allow color differences between our analyzers affect our judgement on other parts of ours : ) bq. In other words, Test2BTerms in src/test should pass on my 32-bit windows machine with whatever we default to. I'm questioning is there any legal, adequate reason to have that much terms. I'm agreeing on mmap+32bit/mmap+windows point for reasonable amount of terms though :/ A hybrid solution, with term-dict being loaded completely into memory (either via mmap, or into arrays) on per-field basis, is probably best in the end, however sad it may be. > Add variable-gap terms index impl. > -- > > Key: LUCENE-2843 > URL: https://issues.apache.org/jira/browse/LUCENE-2843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2843.patch, LUCENE-2843.patch > > > PrefixCodedTermsReader/Writer (used by all "real" core codecs) already > supports pluggable terms index impls. > The only impl we have now is FixedGapTermsIndexReader/Writer, which > picks every Nth (default 32) term and holds it in efficient packed > int/byte arrays in RAM. This is already an enormous improvement (RAM > reduction, init time) over 3.x. > This patch adds another impl, VariableGapTermsIndexReader/Writer, > which lets you specify an arbitrary IndexTermSelector to pick which > terms are indexed, and then uses an FST to hold the indexed terms. > This is typically even more memory efficient than packed int/byte > arrays, though, it does not support ord() so it's not quite a fair > comparison. > I had to relax the terms index plugin api for > PrefixCodedTermsReader/Writer to not assume that the terms index impl > supports ord. > I also did some cleanup of the FST/FSTEnum APIs and impls, and broke > out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor > when the FST is used as a terms index but seekCeil when it's holding > all terms in the index (ie which SimpleText uses FSTs for). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
[ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979346#action_12979346 ] Earwin Burrfoot commented on LUCENE-2843: - bq. I don't like the reasoning that, just because sphinx does it and their 'users manage', that makes it ok. I'm in no way advocating it as an all-round better solution. It has it's wrinkles just as anything else. My reasoning is merely that alternative exists, and it is viable. As proven by pretty high-profile users. They have memory-resident term dictionary, and it works, I heard no complaints regarding this ever. bq. sphinx also requires mysql Have you read anything at all? It has an integration ready, for the layman user who just wants to stick a fulltext search into their little app, but it is in no way reliant on it. Sphinx is a direct alternative to Solr. {quote} But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good decisions on what gets swapped out is risky - Lucene is better informed than the OS on which data structures are worth spending RAM on (norms, terms index, field cache, del docs). If indeed the terms dict (thanks to FSTs) becomes small enough to "fit" in RAM, then we should load it into RAM (and do away w/ the terms index). {quote} That's a bit delusional. If a system is forced to swap out, it'll swap your explicitly managed RAM just as likely as memory-mapped files. I've seen this countless times. But then, you have a number of benefits - like sharing filesystem cache when opening same file multiple times, offloading things from Java heap (which is almost always a good thing), fastest load-into-memory times possible. Sorry, if I sound offending at times, but, damn, there's a whole world of simple and efficient code lying ahead in that direction :) > Add variable-gap terms index impl. > -- > > Key: LUCENE-2843 > URL: https://issues.apache.org/jira/browse/LUCENE-2843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2843.patch, LUCENE-2843.patch > > > PrefixCodedTermsReader/Writer (used by all "real" core codecs) already > supports pluggable terms index impls. > The only impl we have now is FixedGapTermsIndexReader/Writer, which > picks every Nth (default 32) term and holds it in efficient packed > int/byte arrays in RAM. This is already an enormous improvement (RAM > reduction, init time) over 3.x. > This patch adds another impl, VariableGapTermsIndexReader/Writer, > which lets you specify an arbitrary IndexTermSelector to pick which > terms are indexed, and then uses an FST to hold the indexed terms. > This is typically even more memory efficient than packed int/byte > arrays, though, it does not support ord() so it's not quite a fair > comparison. > I had to relax the terms index plugin api for > PrefixCodedTermsReader/Writer to not assume that the terms index impl > supports ord. > I also did some cleanup of the FST/FSTEnum APIs and impls, and broke > out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor > when the FST is used as a terms index but seekCeil when it's holding > all terms in the index (ie which SimpleText uses FSTs for). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)
[ https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979306#action_12979306 ] Earwin Burrfoot commented on LUCENE-2840: - A lot of fork-join type frameworks don't even care. Even though scheduling threads is something people supposedly use them for. Why? I guess that's due to low yield/cost ratio. You frequently quote "progress, not perfection" in relation to the code, but why don't we apply this same principle to our threading guarantees? I don't want to use allowed concurrency fully. That's not realistic. I want 85% of it. That's already a huge leap ahead of single-threaded searches. > Multi-Threading in IndexSearcher (after removal of MultiSearcher and > ParallelMultiSearcher) > --- > > Key: LUCENE-2840 > URL: https://issues.apache.org/jira/browse/LUCENE-2840 > Project: Lucene - Java > Issue Type: Sub-task > Components: Search >Reporter: Uwe Schindler >Priority: Minor > Fix For: 4.0 > > > Spin-off from parent issue: > {quote} > We should discuss about how many threads should be spawned. If you have an > index with many segments, even small ones, I think only the larger segments > should be separate threads, all others should be handled sequentially. So > maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then > only spawn maxThreads-1 threads for the bigger readers and then one > additional thread for the rest? > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
[ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979305#action_12979305 ] Earwin Burrfoot commented on LUCENE-2843: - As I said, there's already a search server with strictly in-memory (in mmap sense. it can theoretically be paged out) terms dict AND widespread adoption. Their users somehow manage. My guess is that's because people with "insane number of terms" store various crap like unique timestamps as terms. With CSF ("attributes" in Sphinx lingo), and some nice filters that can work over CSF, there's no longer any need to stuff your timestamps in the same place you stuff your texts. That can be reflected in documentation, and then, suddenly, we can drop "on-disk only" support. > Add variable-gap terms index impl. > -- > > Key: LUCENE-2843 > URL: https://issues.apache.org/jira/browse/LUCENE-2843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2843.patch, LUCENE-2843.patch > > > PrefixCodedTermsReader/Writer (used by all "real" core codecs) already > supports pluggable terms index impls. > The only impl we have now is FixedGapTermsIndexReader/Writer, which > picks every Nth (default 32) term and holds it in efficient packed > int/byte arrays in RAM. This is already an enormous improvement (RAM > reduction, init time) over 3.x. > This patch adds another impl, VariableGapTermsIndexReader/Writer, > which lets you specify an arbitrary IndexTermSelector to pick which > terms are indexed, and then uses an FST to hold the indexed terms. > This is typically even more memory efficient than packed int/byte > arrays, though, it does not support ord() so it's not quite a fair > comparison. > I had to relax the terms index plugin api for > PrefixCodedTermsReader/Writer to not assume that the terms index impl > supports ord. > I also did some cleanup of the FST/FSTEnum APIs and impls, and broke > out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor > when the FST is used as a terms index but seekCeil when it's holding > all terms in the index (ie which SimpleText uses FSTs for). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
[ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979277#action_12979277 ] Earwin Burrfoot commented on LUCENE-2843: - And we're nearing a day when we keep the whole term dictionary in memory (as Sphinx does for instance). At that point a gazillion of term lookup-related hacks (like lookup cache) become obsolete :) Term dictionary itself can also be memory-mapped after this, instead of being "read" and "built" from disk, which makes new segment opening near-instantaneous. > Add variable-gap terms index impl. > -- > > Key: LUCENE-2843 > URL: https://issues.apache.org/jira/browse/LUCENE-2843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2843.patch, LUCENE-2843.patch > > > PrefixCodedTermsReader/Writer (used by all "real" core codecs) already > supports pluggable terms index impls. > The only impl we have now is FixedGapTermsIndexReader/Writer, which > picks every Nth (default 32) term and holds it in efficient packed > int/byte arrays in RAM. This is already an enormous improvement (RAM > reduction, init time) over 3.x. > This patch adds another impl, VariableGapTermsIndexReader/Writer, > which lets you specify an arbitrary IndexTermSelector to pick which > terms are indexed, and then uses an FST to hold the indexed terms. > This is typically even more memory efficient than packed int/byte > arrays, though, it does not support ord() so it's not quite a fair > comparison. > I had to relax the terms index plugin api for > PrefixCodedTermsReader/Writer to not assume that the terms index impl > supports ord. > I also did some cleanup of the FST/FSTEnum APIs and impls, and broke > out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor > when the FST is used as a terms index but seekCeil when it's holding > all terms in the index (ie which SimpleText uses FSTs for). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)
[ https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979276#action_12979276 ] Earwin Burrfoot commented on LUCENE-2840: - bq. But doesn't that mean that an app w/ rare queries but each query is massive fails to use all available concurrency? Yes. But that's not my case. And likely not someone else's. I think if you want to be super-generic, it's better to defer exact threading to the user, instead of doing a one-size-fits-all solution. Else you risk conjuring another ConcurrentMergeScheduler. While we're at it, we can throw in some sample implementation, which can satisfy some of the users, but not everyone. > Multi-Threading in IndexSearcher (after removal of MultiSearcher and > ParallelMultiSearcher) > --- > > Key: LUCENE-2840 > URL: https://issues.apache.org/jira/browse/LUCENE-2840 > Project: Lucene - Java > Issue Type: Sub-task > Components: Search >Reporter: Uwe Schindler >Priority: Minor > Fix For: 4.0 > > > Spin-off from parent issue: > {quote} > We should discuss about how many threads should be spawned. If you have an > index with many segments, even small ones, I think only the larger segments > should be separate threads, all others should be handled sequentially. So > maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then > only spawn maxThreads-1 threads for the bigger readers and then one > additional thread for the rest? > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)
[ https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976027#action_12976027 ] Earwin Burrfoot commented on LUCENE-2840: - I use the following scheme: * There is a fixed pool of threads shared by all searches, that limits total concurrency. * Each new search apprehends at most a fixed number of threads from this pool (say, 2-3 of 8 in my setup), * and these threads churn through segments as through a queue (in maxDoc order, but I think even that is unnecessary). No special smart binding between threads and segments (eg. 1 thread for each biggie, 1 thread for all of the small ones) - means simpler code, and zero possibility of stalling, when there are threads to run, segments to search, but binding policy does not connect them. Using fewer threads per-search than total available is a precaution against biggie searches blocking fast ones. > Multi-Threading in IndexSearcher (after removal of MultiSearcher and > ParallelMultiSearcher) > --- > > Key: LUCENE-2840 > URL: https://issues.apache.org/jira/browse/LUCENE-2840 > Project: Lucene - Java > Issue Type: Sub-task > Components: Search >Reporter: Uwe Schindler >Priority: Minor > Fix For: 4.0 > > > Spin-off from parent issue: > {quote} > We should discuss about how many threads should be spawned. If you have an > index with many segments, even small ones, I think only the larger segments > should be separate threads, all others should be handled sequentially. So > maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then > only spawn maxThreads-1 threads for the bigger readers and then one > additional thread for the rest? > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2825) FSDirectory.open should return MMap on 64-bit Solaris
[ https://issues.apache.org/jira/browse/LUCENE-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975212#action_12975212 ] Earwin Burrfoot commented on LUCENE-2825: - {quote} bq. CPU cache / TLB effects? Using more address space isn't completely free. In the case of Solaris I think there is much less of a chance of TLB effects? For example on Solaris java automatically uses large pages (unlike Linux, Windows etc). {quote} For, like, 13Gb of memory-mapped index, I've seen no noticeable difference between having large pages on and off under Linux. That's some anecdotal evidence, as I'ven't done any extensive research, but still. > FSDirectory.open should return MMap on 64-bit Solaris > - > > Key: LUCENE-2825 > URL: https://issues.apache.org/jira/browse/LUCENE-2825 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Robert Muir > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2825.patch > > > MMap is ~ 30% faster than NIOFS on this platform. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery "pk lookup" performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974350#action_12974350 ] Earwin Burrfoot commented on LUCENE-2829: - Nobody halts your progress, we're merely discussing. I, on the other hand, have a feeling that Lucene is overflowing with "single incremental improvements" aka "hacks", as they are easier and faster to implement than trying to get a bigger picture, and, yes, rebuilding everything :) For example, better term dict code will make this issue (somewhat hackish, admit it?) irrelevant. Whether we implement bloom filters, or just guarantee to keep the whole term dict in memory with reasonable lookup routine (eg. as FST). Having said that, I reiterate, I'm not here to stop you or turn this issue into something else. > improve termquery "pk lookup" performance > - > > Key: LUCENE-2829 > URL: https://issues.apache.org/jira/browse/LUCENE-2829 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Robert Muir > Attachments: LUCENE-2829.patch > > > For things that are like primary keys and don't exist in some segments (worst > case is primary/unique key that only exists in 1) > we do wasted seeks. > While LUCENE-2694 tries to solve some of this issue with TermState, I'm > concerned we could every backport that to 3.1 for example. > This is a simpler solution here just to solve this one problem in > termquery... we could just revert it in trunk when we resolve LUCENE-2694, > but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery "pk lookup" performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974274#action_12974274 ] Earwin Burrfoot commented on LUCENE-2829: - Term lookup misses can be alleviated by a simple Bloom Filter. No caching misses required, helps both PK and near-PK queries. > improve termquery "pk lookup" performance > - > > Key: LUCENE-2829 > URL: https://issues.apache.org/jira/browse/LUCENE-2829 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Robert Muir > Attachments: LUCENE-2829.patch > > > For things that are like primary keys and don't exist in some segments (worst > case is primary/unique key that only exists in 1) > we do wasted seeks. > While LUCENE-2694 tries to solve some of this issue with TermState, I'm > concerned we could every backport that to 3.1 for example. > This is a simpler solution here just to solve this one problem in > termquery... we could just revert it in trunk when we resolve LUCENE-2694, > but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2814: Attachment: LUCENE-2814.patch Synced to trunk. bq. Also, on the nocommit on exc in DW.addDocument, yes I think that (IFD.deleteNewFiles, not checkpoint) is still needed because DW can orphan the store files on abort? Orphaned files are deleted directly in StoredFieldsWriter.abort() and TermVectorsTermsWriter.abort(). As I said - all the open files tracking is now gone. Turns out checkpoint() is also no longer needed. I have no other lingering cleanup urges, this is ready to be committed. I think. > stop writing shared doc stores across segments > -- > > Key: LUCENE-2814 > URL: https://issues.apache.org/jira/browse/LUCENE-2814 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 3.1, 4.0 >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch, > LUCENE-2814.patch, LUCENE-2814.patch > > > Shared doc stores enables the files for stored fields and term vectors to be > shared across multiple segments. We've had this optimization since 2.1 I > think. > It works best against a new index, where you open an IW, add lots of docs, > and then close it. In that case all of the written segments will reference > slices a single shared doc store segment. > This was a good optimization because it means we never need to merge these > files. But, when you open another IW on that index, it writes a new set of > doc stores, and then whenever merges take place across doc stores, they must > now be merged. > However, since we switched to shared doc stores, there have been two > optimizations for merging the stores. First, we now bulk-copy the bytes in > these files if the field name/number assignment is "congruent". Second, we > now force congruent field name/number mapping in IndexWriter. This means > this optimization is much less potent than it used to be. > Furthermore, the optimization adds *a lot* of hair to > IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over > time, and causes odd behavior like a merge possibly forcing a flush when it > starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent > flushing, we can no longer share doc stores. > So, I think we should turn off the write-side of shared doc stores to pave > the path for DWPT to land on trunk and simplify IW/DW. We still must support > reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2818) abort() method for IndexOutput
[ https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2818: Priority: Minor (was: Major) This change is really minor, but I think, convinient. You don't have to lug reference to Directory along, and recalculate the file name, if the only thing you want to say is that write was a failure and you no longer need this file. > abort() method for IndexOutput > -- > > Key: LUCENE-2818 > URL: https://issues.apache.org/jira/browse/LUCENE-2818 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Earwin Burrfoot >Priority: Minor > > I'd like to see abort() method on IndexOutput that silently (no exceptions) > closes IO and then does silent papaDir.deleteFile(this.fileName()). > This will simplify a bunch of error recovery code for IndexWriter and > friends, but constitutes an API backcompat break. > What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2818) abort() method for IndexOutput
[ https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972765#action_12972765 ] Earwin Burrfoot commented on LUCENE-2818: - bq. I think we can make a default impl that simply closes & suppresses exceptions? (We can't .deleteFile since an abstract IO doesn't know its Dir). Our concrete impls can override w/ versions that do delete the file... I don't think we need a default impl? For some directory impls close() is a noop + what is more important, having abstract method forces you to implement it, you can't forget this, so we're not gonna see broken directories that don't do abort() properly. > abort() method for IndexOutput > -- > > Key: LUCENE-2818 > URL: https://issues.apache.org/jira/browse/LUCENE-2818 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Earwin Burrfoot > > I'd like to see abort() method on IndexOutput that silently (no exceptions) > closes IO and then does silent papaDir.deleteFile(this.fileName()). > This will simplify a bunch of error recovery code for IndexWriter and > friends, but constitutes an API backcompat break. > What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2818) abort() method for IndexOutput
[ https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972764#action_12972764 ] Earwin Burrfoot commented on LUCENE-2818: - bq. Can abort() have a default impl in IndexOutput, such as close() followed by deleteFile() maybe? If so, then it won't break anything. It can't. To call deleteFile you need both a reference to papa-Directory and a name of the file this IO writes to. Abstract IO class has neither. If we add them, they have to be passed to a new constructor, and that's an API break ;) bq. Would abort() on Directory fit better? E.g., it can abort all currently open and modified files, instead of the caller calling abort() on each IndexOutput? Are you thinking of a case where a write failed, and the caller would call abort() immediately, instead of some higher-level code? If so, would rollback() be a better name? Oh, no, no. No way. I don't want to push someone else's responsibility on Directory. This abort() is merely a shortcut. Let's go with a usage example: Here's FieldsWriter.java with LUCENE-2814 applied (skipping irrelevant parts) - https://gist.github.com/746358 Now, the same, with abort() - https://gist.github.com/746367 > abort() method for IndexOutput > -- > > Key: LUCENE-2818 > URL: https://issues.apache.org/jira/browse/LUCENE-2818 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Earwin Burrfoot > > I'd like to see abort() method on IndexOutput that silently (no exceptions) > closes IO and then does silent papaDir.deleteFile(this.fileName()). > This will simplify a bunch of error recovery code for IndexWriter and > friends, but constitutes an API backcompat break. > What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2814: Attachment: LUCENE-2814.patch New patch. Now with even more lines removed! DocStore-related index chain components used to track open/closed files through DocumentsWriter. Closed files list was unused, and is silently gone. Open files list was used to: * prevent not-yet-flushed shared docstores from being deleted by IndexFileDeleter. ** no shared docstores, no need + IFD no longer requires a reference to DW * delete already opened docstore files, when aborting. ** index chain now handles this on its own + has cleaner error handling code. > stop writing shared doc stores across segments > -- > > Key: LUCENE-2814 > URL: https://issues.apache.org/jira/browse/LUCENE-2814 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 3.1, 4.0 >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch, > LUCENE-2814.patch > > > Shared doc stores enables the files for stored fields and term vectors to be > shared across multiple segments. We've had this optimization since 2.1 I > think. > It works best against a new index, where you open an IW, add lots of docs, > and then close it. In that case all of the written segments will reference > slices a single shared doc store segment. > This was a good optimization because it means we never need to merge these > files. But, when you open another IW on that index, it writes a new set of > doc stores, and then whenever merges take place across doc stores, they must > now be merged. > However, since we switched to shared doc stores, there have been two > optimizations for merging the stores. First, we now bulk-copy the bytes in > these files if the field name/number assignment is "congruent". Second, we > now force congruent field name/number mapping in IndexWriter. This means > this optimization is much less potent than it used to be. > Furthermore, the optimization adds *a lot* of hair to > IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over > time, and causes odd behavior like a merge possibly forcing a flush when it > starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent > flushing, we can no longer share doc stores. > So, I think we should turn off the write-side of shared doc stores to pave > the path for DWPT to land on trunk and simplify IW/DW. We still must support > reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2818) abort() method for IndexOutput
abort() method for IndexOutput -- Key: LUCENE-2818 URL: https://issues.apache.org/jira/browse/LUCENE-2818 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot I'd like to see abort() method on IndexOutput that silently (no exceptions) closes IO and then does silent papaDir.deleteFile(this.fileName()). This will simplify a bunch of error recovery code for IndexWriter and friends, but constitutes an API backcompat break. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2814: Attachment: LUCENE-2814.patch Patch updated to trunk, no nocommits, no *.closeDocStore(), tests pass. SegmentWriteState vs DocumentsWriter bother me. We track flushed files in both, we inconsistently get current segment from both of them. > stop writing shared doc stores across segments > -- > > Key: LUCENE-2814 > URL: https://issues.apache.org/jira/browse/LUCENE-2814 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 3.1, 4.0 >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch > > > Shared doc stores enables the files for stored fields and term vectors to be > shared across multiple segments. We've had this optimization since 2.1 I > think. > It works best against a new index, where you open an IW, add lots of docs, > and then close it. In that case all of the written segments will reference > slices a single shared doc store segment. > This was a good optimization because it means we never need to merge these > files. But, when you open another IW on that index, it writes a new set of > doc stores, and then whenever merges take place across doc stores, they must > now be merged. > However, since we switched to shared doc stores, there have been two > optimizations for merging the stores. First, we now bulk-copy the bytes in > these files if the field name/number assignment is "congruent". Second, we > now force congruent field name/number mapping in IndexWriter. This means > this optimization is much less potent than it used to be. > Furthermore, the optimization adds *a lot* of hair to > IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over > time, and causes odd behavior like a merge possibly forcing a flush when it > starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent > flushing, we can no longer share doc stores. > So, I think we should turn off the write-side of shared doc stores to pave > the path for DWPT to land on trunk and simplify IW/DW. We still must support > reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972316#action_12972316 ] Earwin Burrfoot commented on LUCENE-2814: - Instead of you pulling out docstore removal, I can finish that patch. But then merging's gonna be even greater bitch. Probably. But maybe not. Do you do IRC? It can be faster to discuss in realtime, and you could also tell what help you need with the branch. > stop writing shared doc stores across segments > -- > > Key: LUCENE-2814 > URL: https://issues.apache.org/jira/browse/LUCENE-2814 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 3.1, 4.0 >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-2814.patch, LUCENE-2814.patch > > > Shared doc stores enables the files for stored fields and term vectors to be > shared across multiple segments. We've had this optimization since 2.1 I > think. > It works best against a new index, where you open an IW, add lots of docs, > and then close it. In that case all of the written segments will reference > slices a single shared doc store segment. > This was a good optimization because it means we never need to merge these > files. But, when you open another IW on that index, it writes a new set of > doc stores, and then whenever merges take place across doc stores, they must > now be merged. > However, since we switched to shared doc stores, there have been two > optimizations for merging the stores. First, we now bulk-copy the bytes in > these files if the field name/number assignment is "congruent". Second, we > now force congruent field name/number mapping in IndexWriter. This means > this optimization is much less potent than it used to be. > Furthermore, the optimization adds *a lot* of hair to > IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over > time, and causes odd behavior like a merge possibly forcing a flush when it > starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent > flushing, we can no longer share doc stores. > So, I think we should turn off the write-side of shared doc stores to pave > the path for DWPT to land on trunk and simplify IW/DW. We still must support > reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972298#action_12972298 ] Earwin Burrfoot commented on LUCENE-2814: - So, what's the plan? > stop writing shared doc stores across segments > -- > > Key: LUCENE-2814 > URL: https://issues.apache.org/jira/browse/LUCENE-2814 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 3.1, 4.0 >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-2814.patch, LUCENE-2814.patch > > > Shared doc stores enables the files for stored fields and term vectors to be > shared across multiple segments. We've had this optimization since 2.1 I > think. > It works best against a new index, where you open an IW, add lots of docs, > and then close it. In that case all of the written segments will reference > slices a single shared doc store segment. > This was a good optimization because it means we never need to merge these > files. But, when you open another IW on that index, it writes a new set of > doc stores, and then whenever merges take place across doc stores, they must > now be merged. > However, since we switched to shared doc stores, there have been two > optimizations for merging the stores. First, we now bulk-copy the bytes in > these files if the field name/number assignment is "congruent". Second, we > now force congruent field name/number mapping in IndexWriter. This means > this optimization is much less potent than it used to be. > Furthermore, the optimization adds *a lot* of hair to > IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over > time, and causes odd behavior like a merge possibly forcing a flush when it > starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent > flushing, we can no longer share doc stores. > So, I think we should turn off the write-side of shared doc stores to pave > the path for DWPT to land on trunk and simplify IW/DW. We still must support > reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972259#action_12972259 ] Earwin Burrfoot commented on LUCENE-2814: - Ugh. On first glance @ realtime branch, my patch is not a strict subset. But most stuff, including things I'm struggling with now, is definetly in. So, I guess, I'll stop wasting effort. What is the state for realtime branch? Is it stable? How soon is "sometime soon"? :) > stop writing shared doc stores across segments > -- > > Key: LUCENE-2814 > URL: https://issues.apache.org/jira/browse/LUCENE-2814 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 3.1, 4.0 >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-2814.patch, LUCENE-2814.patch > > > Shared doc stores enables the files for stored fields and term vectors to be > shared across multiple segments. We've had this optimization since 2.1 I > think. > It works best against a new index, where you open an IW, add lots of docs, > and then close it. In that case all of the written segments will reference > slices a single shared doc store segment. > This was a good optimization because it means we never need to merge these > files. But, when you open another IW on that index, it writes a new set of > doc stores, and then whenever merges take place across doc stores, they must > now be merged. > However, since we switched to shared doc stores, there have been two > optimizations for merging the stores. First, we now bulk-copy the bytes in > these files if the field name/number assignment is "congruent". Second, we > now force congruent field name/number mapping in IndexWriter. This means > this optimization is much less potent than it used to be. > Furthermore, the optimization adds *a lot* of hair to > IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over > time, and causes odd behavior like a merge possibly forcing a flush when it > starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent > flushing, we can no longer share doc stores. > So, I think we should turn off the write-side of shared doc stores to pave > the path for DWPT to land on trunk and simplify IW/DW. We still must support > reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2814: Attachment: LUCENE-2814.patch First iteration. Passes all tests except TestNRTThreads. Something to do with numDocsInStore and numDocsInRam merged together? Lots of non-critical nocommits (just markers for places I'd like to recheck). DW.docStoreEnabled and *.closeDocStore() have to go, before committing > stop writing shared doc stores across segments > -- > > Key: LUCENE-2814 > URL: https://issues.apache.org/jira/browse/LUCENE-2814 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 3.1, 4.0 >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-2814.patch > > > Shared doc stores enables the files for stored fields and term vectors to be > shared across multiple segments. We've had this optimization since 2.1 I > think. > It works best against a new index, where you open an IW, add lots of docs, > and then close it. In that case all of the written segments will reference > slices a single shared doc store segment. > This was a good optimization because it means we never need to merge these > files. But, when you open another IW on that index, it writes a new set of > doc stores, and then whenever merges take place across doc stores, they must > now be merged. > However, since we switched to shared doc stores, there have been two > optimizations for merging the stores. First, we now bulk-copy the bytes in > these files if the field name/number assignment is "congruent". Second, we > now force congruent field name/number mapping in IndexWriter. This means > this optimization is much less potent than it used to be. > Furthermore, the optimization adds *a lot* of hair to > IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over > time, and causes odd behavior like a merge possibly forcing a flush when it > starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent > flushing, we can no longer share doc stores. > So, I think we should turn off the write-side of shared doc stores to pave > the path for DWPT to land on trunk and simplify IW/DW. We still must support > reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2811) SegmentInfo should explicitly track whether that segment wrote term vectors
[ https://issues.apache.org/jira/browse/LUCENE-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971510#action_12971510 ] Earwin Burrfoot commented on LUCENE-2811: - >From IRC: SegmentMerger.hasVectors carries no new information compared to OneMerge.hasVectors, and can be dropped. OneMerge.hasVectors is initialized just near OneMerge.info, and is later used to set OneMerge.info.hasVectors, might as well do that from the get go and drop OM.hV. > SegmentInfo should explicitly track whether that segment wrote term vectors > --- > > Key: LUCENE-2811 > URL: https://issues.apache.org/jira/browse/LUCENE-2811 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2811.patch > > > Today SegmentInfo doesn't know if it has vectors, which means its files() > method must check if the files exist. > This leads to subtle bugs, because Si.files() caches the files but then we > fail to invalidate that later when the term vectors files are created. > It also leads to sloppy code, eg TermVectorsReader "gracefully" handles being > opened when the files do not exist. I don't like that; it should only be > opened if they exist. > This also fixes these intermittent failures we've been seeing: > {noformat} > junit.framework.AssertionFailedError: IndexFileDeleter doesn't know about > file _1e.tvx >at > org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:979) >at > org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:917) >at > org.apache.lucene.index.IndexWriter.filesExist(IndexWriter.java:3633) >at > org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:3699) >at > org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2407) >at > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2478) >at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2460) >at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2444) >at > org.apache.lucene.index.TestIndexWriterExceptions.testRandomExceptionsThreads(TestIndexWriterExceptions.java:213) > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2811) SegmentInfo should explicitly track whether that segment wrote term vectors
[ https://issues.apache.org/jira/browse/LUCENE-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971303#action_12971303 ] Earwin Burrfoot commented on LUCENE-2811: - I think SegmentInfo.hasVectors should be a boolean. If this is an old index, we can check the file presence in SegmentInfo constructor, set it properly, and on next write index is silently upgraded. > SegmentInfo should explicitly track whether that segment wrote term vectors > --- > > Key: LUCENE-2811 > URL: https://issues.apache.org/jira/browse/LUCENE-2811 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2811.patch > > > Today SegmentInfo doesn't know if it has vectors, which means its files() > method must check if the files exist. > This leads to subtle bugs, because Si.files() caches the files but then we > fail to invalidate that later when the term vectors files are created. > It also leads to sloppy code, eg TermVectorsReader "gracefully" handles being > opened when the files do not exist. I don't like that; it should only be > opened if they exist. > This also fixes these intermittent failures we've been seeing: > {noformat} > junit.framework.AssertionFailedError: IndexFileDeleter doesn't know about > file _1e.tvx >at > org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:979) >at > org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:917) >at > org.apache.lucene.index.IndexWriter.filesExist(IndexWriter.java:3633) >at > org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:3699) >at > org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2407) >at > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2478) >at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2460) >at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2444) >at > org.apache.lucene.index.TestIndexWriterExceptions.testRandomExceptionsThreads(TestIndexWriterExceptions.java:213) > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971248#action_12971248 ] Earwin Burrfoot commented on LUCENE-2814: - bq. We should verify the back-compat test has at least one index w/ shared doc stores. I believe I've seen some back-compat failures with my quick'n'dirty patch that removed both reading and writing shared docstores. So it should be ok. > stop writing shared doc stores across segments > -- > > Key: LUCENE-2814 > URL: https://issues.apache.org/jira/browse/LUCENE-2814 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 3.1, 4.0 >Reporter: Michael McCandless >Assignee: Michael McCandless > > Shared doc stores enables the files for stored fields and term vectors to be > shared across multiple segments. We've had this optimization since 2.1 I > think. > It works best against a new index, where you open an IW, add lots of docs, > and then close it. In that case all of the written segments will reference > slices a single shared doc store segment. > This was a good optimization because it means we never need to merge these > files. But, when you open another IW on that index, it writes a new set of > doc stores, and then whenever merges take place across doc stores, they must > now be merged. > However, since we switched to shared doc stores, there have been two > optimizations for merging the stores. First, we now bulk-copy the bytes in > these files if the field name/number assignment is "congruent". Second, we > now force congruent field name/number mapping in IndexWriter. This means > this optimization is much less potent than it used to be. > Furthermore, the optimization adds *a lot* of hair to > IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over > time, and causes odd behavior like a merge possibly forcing a flush when it > starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent > flushing, we can no longer share doc stores. > So, I think we should turn off the write-side of shared doc stores to pave > the path for DWPT to land on trunk and simplify IW/DW. We still must support > reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2611) IntelliJ IDEA setup
[ https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971058#action_12971058 ] Earwin Burrfoot edited comment on LUCENE-2611 at 12/13/10 5:36 PM: --- {quote} bq. I wonder if several .iml files can be in the same directory but their so-called "content-roots" would be set to where they are now? I'm pretty sure IntelliJ allows only one .iml per directory. {quote} My IDEA project for Lucene has all .iml files in root dir. Various content-roots for sure. I recall that worked for at least several major IDEA versions. was (Author: earwin): [quote] bq. I wonder if several .iml files can be in the same directory but their so-called "content-roots" would be set to where they are now? I'm pretty sure IntelliJ allows only one .iml per directory. [quote] My IDEA project for Lucene has all .iml files in root dir. Various content-roots for sure. I recall that worked for at least several major IDEA versions. > IntelliJ IDEA setup > --- > > Key: LUCENE-2611 > URL: https://issues.apache.org/jira/browse/LUCENE-2611 > Project: Lucene - Java > Issue Type: New Feature > Components: Build >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2611-branch-3x.patch, > LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, > LUCENE-2611-branch-3x.patch, LUCENE-2611.patch, LUCENE-2611.patch, > LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, > LUCENE-2611.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, > LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, > LUCENE-2611_test_2.patch > > > Setting up Lucene/Solr in IntelliJ IDEA can be time-consuming. > The attached patch adds a new top level directory {{dev-tools/}} with sub-dir > {{idea/}} containing basic setup files for trunk, as well as a top-level ant > target named "idea" that copies these files into the proper locations. This > arrangement avoids the messiness attendant to in-place project configuration > files directly checked into source control. > The IDEA configuration includes modules for Lucene and Solr, each Lucene and > Solr contrib, and each analysis module. A JUnit test run per module is > included. > Once {{ant idea}} has been run, the only configuration that must be performed > manually is configuring the project-level JDK. > If this patch is committed, Subversion svn:ignore properties should be > added/modified to ignore the destination module files (*.iml) in each > module's directory. > Iam Jambour has written up on the Lucene wiki a detailed set of instructions > for applying the 3.X branch patch: > http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2611) IntelliJ IDEA setup
[ https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971058#action_12971058 ] Earwin Burrfoot commented on LUCENE-2611: - [quote] bq. I wonder if several .iml files can be in the same directory but their so-called "content-roots" would be set to where they are now? I'm pretty sure IntelliJ allows only one .iml per directory. [quote] My IDEA project for Lucene has all .iml files in root dir. Various content-roots for sure. I recall that worked for at least several major IDEA versions. > IntelliJ IDEA setup > --- > > Key: LUCENE-2611 > URL: https://issues.apache.org/jira/browse/LUCENE-2611 > Project: Lucene - Java > Issue Type: New Feature > Components: Build >Affects Versions: 3.1, 4.0 >Reporter: Steven Rowe >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2611-branch-3x.patch, > LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, > LUCENE-2611-branch-3x.patch, LUCENE-2611.patch, LUCENE-2611.patch, > LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, > LUCENE-2611.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, > LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, > LUCENE-2611_test_2.patch > > > Setting up Lucene/Solr in IntelliJ IDEA can be time-consuming. > The attached patch adds a new top level directory {{dev-tools/}} with sub-dir > {{idea/}} containing basic setup files for trunk, as well as a top-level ant > target named "idea" that copies these files into the proper locations. This > arrangement avoids the messiness attendant to in-place project configuration > files directly checked into source control. > The IDEA configuration includes modules for Lucene and Solr, each Lucene and > Solr contrib, and each analysis module. A JUnit test run per module is > included. > Once {{ant idea}} has been run, the only configuration that must be performed > manually is configuring the project-level JDK. > If this patch is committed, Subversion svn:ignore properties should be > added/modified to ignore the destination module files (*.iml) in each > module's directory. > Iam Jambour has written up on the Lucene wiki a detailed set of instructions > for applying the 3.X branch patch: > http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971057#action_12971057 ] Earwin Burrfoot commented on LUCENE-2814: - I'll take this. I think. > stop writing shared doc stores across segments > -- > > Key: LUCENE-2814 > URL: https://issues.apache.org/jira/browse/LUCENE-2814 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 3.1, 4.0 >Reporter: Michael McCandless >Assignee: Michael McCandless > > Shared doc stores enables the files for stored fields and term vectors to be > shared across multiple segments. We've had this optimization since 2.1 I > think. > It works best against a new index, where you open an IW, add lots of docs, > and then close it. In that case all of the written segments will reference > slices a single shared doc store segment. > This was a good optimization because it means we never need to merge these > files. But, when you open another IW on that index, it writes a new set of > doc stores, and then whenever merges take place across doc stores, they must > now be merged. > However, since we switched to shared doc stores, there have been two > optimizations for merging the stores. First, we now bulk-copy the bytes in > these files if the field name/number assignment is "congruent". Second, we > now force congruent field name/number mapping in IndexWriter. This means > this optimization is much less potent than it used to be. > Furthermore, the optimization adds *a lot* of hair to > IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over > time, and causes odd behavior like a merge possibly forcing a flush when it > starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent > flushing, we can no longer share doc stores. > So, I think we should turn off the write-side of shared doc stores to pave > the path for DWPT to land on trunk and simplify IW/DW. We still must support > reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2802) DirectoryReader ignores NRT SegmentInfos in #isOptimized()
[ https://issues.apache.org/jira/browse/LUCENE-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968503#action_12968503 ] Earwin Burrfoot commented on LUCENE-2802: - Patch looks cool. > DirectoryReader ignores NRT SegmentInfos in #isOptimized() > -- > > Key: LUCENE-2802 > URL: https://issues.apache.org/jira/browse/LUCENE-2802 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 3.1, 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer > Attachments: LUCENE-2802.patch, LUCENE-2802.patch > > > DirectoryReader only takes shared (with IW) SegmentInfos into account in > DirectoryReader#isOptimized(). This can return true even if the actual > realtime reader sees more than one segments. > {code} > public boolean isOptimized() { > ensureOpen(); >// if segmentsInfos changes in IW this can return false positive > return segmentInfos.size() == 1 && !hasDeletions(); > } > {code} > DirectoryReader should check if this reader has a non-nul segmentInfosStart > and use that instead -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2802) DirectoryReader ignores NRT SegmentInfos in #isOptimized()
[ https://issues.apache.org/jira/browse/LUCENE-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967238#action_12967238 ] Earwin Burrfoot commented on LUCENE-2802: - Heh, I've [mentioned|https://issues.apache.org/jira/browse/LUCENE-2355?focusedCommentId=12874857&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12874857] this earlier in LUCENE-2355. :) > DirectoryReader ignores NRT SegmentInfos in #isOptimized() > -- > > Key: LUCENE-2802 > URL: https://issues.apache.org/jira/browse/LUCENE-2802 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 4.0 >Reporter: Simon Willnauer > Attachments: LUCENE-2802.patch > > > DirectoryReader only takes shared (with IW) SegmentInfos into account in > DirectoryReader#isOptimized(). This can return true even if the actual > realtime reader sees more than one segments. > {code} > public boolean isOptimized() { > ensureOpen(); >// if segmentsInfos changes in IW this can return false positive > return segmentInfos.size() == 1 && !hasDeletions(); > } > {code} > DirectoryReader should check if this reader has a non-nul segmentInfosStart > and use that instead -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2799) MMapDirectory not designed for inheritance
[ https://issues.apache.org/jira/browse/LUCENE-2799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966865#action_12966865 ] Earwin Burrfoot commented on LUCENE-2799: - I think it's always best to copy-paste stuff to your project and adapt it, in such cases. It's simple, it doesn't break when you upgrade, it works with clumsily-built code. Designing actual classes (not API interfaces, but fullblown working code) for inheritance is _very_ complex. I've seen few people trying this for real, and even fewer succeeding. > MMapDirectory not designed for inheritance > -- > > Key: LUCENE-2799 > URL: https://issues.apache.org/jira/browse/LUCENE-2799 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Affects Versions: 3.0.3 >Reporter: René Treffer > > How to reproduce > Try to inherit from MMapDirectory to change the openInput logic (open files > from different directories). > Expected result: > Inherit from MMapDirectory, overwrite the one method, done. > Actual result: > It's impossible to overwrite the method as the inner classes would be > missing. It's impossible to fork the inner classes as they depend on a final > method with default visibility (cleanMapping). > It turns out to be the easiest option to completely for the code and replace > just the method in question. > Possible fix: > Change the visibility of most members and subtypes to be at least protected > and avoid the default visibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
[ https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966550#action_12966550 ] Earwin Burrfoot commented on LUCENE-2790: - Ok, let's commit? There's no need to force first few commits to CFS. CFS' sole purporse is to keep number of simultaneously open files low. Not likely you gonna see frightening numbers with only a pair of segments in index. Later these segments are merged (and probably CFSed), so no worries. > IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile > --- > > Key: LUCENE-2790 > URL: https://issues.apache.org/jira/browse/LUCENE-2790 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch, > LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch > > > Spin off from here: > http://www.gossamer-threads.com/lists/lucene/java-dev/112311. > I will attach a patch shortly that addresses the issue on trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2471) Supporting bulk copies in Directory
[ https://issues.apache.org/jira/browse/LUCENE-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966358#action_12966358 ] Earwin Burrfoot commented on LUCENE-2471: - Hmmm. Are we going to do this? Optimized bulk copies IndexInput -> IndexOutput for merges. I currently see II.copyBytes(IndexOutput out, long numBytes) method in trunk, but it's a little bit of a mess (II.copyBytes calls IO.copyBytes, strange overrides doing the same thing in various ways), no optimizations for FSDirectory (or at least NIOFSdirectory) case, no offset parameter? not used when bulk-merging? (well, DataOutput.copyBytes _is_ used, but there's a single inefficient version of it) > Supporting bulk copies in Directory > --- > > Key: LUCENE-2471 > URL: https://issues.apache.org/jira/browse/LUCENE-2471 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Earwin Burrfoot > Fix For: 3.1, 4.0 > > > A method can be added to IndexOutput that accepts IndexInput, and writes > bytes using it as a source. > This should be used for bulk-merge cases (offhand - norms, docstores?). Some > Directories can then override default impl and skip intermediate buffers > (NIO, MMap, RAM?). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-2307) Spurious exception in TestIndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot closed LUCENE-2307. --- Resolution: Cannot Reproduce Never reproduced since, closing as stale and outdated > Spurious exception in TestIndexWriter > - > > Key: LUCENE-2307 > URL: https://issues.apache.org/jira/browse/LUCENE-2307 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 3.0.2 > Environment: MacOS X, Java 6 >Reporter: Earwin Burrfoot > > Happened on trunk: > [junit] Testsuite: org.apache.lucene.index.TestIndexWriter > [junit] Tests run: 106, Failures: 1, Errors: 0, Time elapsed: 18.567 sec > [junit] > [junit] - Standard Output --- > [junit] java.io.IOException: background merge hit exception: _0:c19 > _1:c19 _2:c2 _3:c2 _4:c2 _5:c2 _6:c2 _7:c2 _8:c1 into _9 [optimize] > [mergeDocStores] > [junit] at > org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2341) > [junit] at > org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2280) > [junit] at > org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2250) > [junit] at > org.apache.lucene.index.TestIndexWriter$CommitAndAddIndexes.doBody(TestIndexWriter.java:3759) > [junit] at > org.apache.lucene.index.TestIndexWriter$RunAddIndexesThreads$1.run(TestIndexWriter.java:3708) > [junit] Caused by: > org.apache.lucene.index.MergePolicy$MergeAbortedException: merge is aborted: > _0:c19 _1:c19 _2:c2 _3:c2 _4:c2 _5:c2 _6:c2 _7:c2 _8:c1 into _9 [optimize] > [mergeDocStores] > [junit] at > org.apache.lucene.index.MergePolicy$OneMerge.checkAborted(MergePolicy.java:135) > [junit] at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4308) > [junit] at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3958) > [junit] at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:339) > [junit] at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:407) > [junit] - --- > [junit] Testcase: > testAddIndexesWithClose(org.apache.lucene.index.TestIndexWriter): FAILED > [junit] null > [junit] junit.framework.AssertionFailedError: null > [junit] at > org.apache.lucene.index.TestIndexWriter.testAddIndexesWithClose(TestIndexWriter.java:3835) > [junit] at > org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:252) > [junit] > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
[ https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966285#action_12966285 ] Earwin Burrfoot commented on LUCENE-2790: - Shai, what about: bq. My only concern is... it looks like addIndexes(IR[]), with compound file used in the end, may fail to delete the non-compound files once the SegmentInfo is committed? I fixed everything else, but can't answer this question. > IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile > --- > > Key: LUCENE-2790 > URL: https://issues.apache.org/jira/browse/LUCENE-2790 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch, > LUCENE-2790.patch, LUCENE-2790.patch > > > Spin off from here: > http://www.gossamer-threads.com/lists/lucene/java-dev/112311. > I will attach a patch shortly that addresses the issue on trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
[ https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2790: Attachment: LUCENE-2790.patch Fixed your test failure > IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile > --- > > Key: LUCENE-2790 > URL: https://issues.apache.org/jira/browse/LUCENE-2790 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch, > LUCENE-2790.patch, LUCENE-2790.patch > > > Spin off from here: > http://www.gossamer-threads.com/lists/lucene/java-dev/112311. > I will attach a patch shortly that addresses the issue on trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
[ https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2790: Attachment: LUCENE-2790.patch Okay, this patch fixes remaining threading issue in IW.mergeMiddle, and three tests that were expecting CFS segments and weren't getting ones due to flush now respecting noCFSRatio and noCFSRatio default of 0.1 > IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile > --- > > Key: LUCENE-2790 > URL: https://issues.apache.org/jira/browse/LUCENE-2790 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch, > LUCENE-2790.patch > > > Spin off from here: > http://www.gossamer-threads.com/lists/lucene/java-dev/112311. > I will attach a patch shortly that addresses the issue on trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
[ https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966112#action_12966112 ] Earwin Burrfoot commented on LUCENE-2790: - bq. I checked who implements useCompoundFile and all I find is LogMP and NoMP, both don't iterate on the SegmentInfos. What MP did you test with? Apply my patch, it changes LogMP to use SegmentInfos. bq. So if you can paste here the 'ant test' that reproduces it, it'd be great. ant test -Dtestcase=TestAddIndexes -Dtestmethod=testAddIndexesWithThreads -Dtests.seed=5369960668186287821:331425426639083833 -Dtests.codec=randomPerField The test is threaded, so it doesn't fail always. > IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile > --- > > Key: LUCENE-2790 > URL: https://issues.apache.org/jira/browse/LUCENE-2790 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2790.patch, LUCENE-2790.patch > > > Spin off from here: > http://www.gossamer-threads.com/lists/lucene/java-dev/112311. > I will attach a patch shortly that addresses the issue on trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
[ https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2790: Attachment: LUCENE-2790.patch Check this patch out. It moves noCFS ratio to useCompoundFile(SIS, SI) and drops useCompoundFile from OneMerge, so all decisions about using compound files now happen in a single place. It also highlights the problem with your patch - when calling useCompoundFile from addIndexes, you should hold a lock, so segmentInfos won't be modified while mergePolicy inspects them. > IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile > --- > > Key: LUCENE-2790 > URL: https://issues.apache.org/jira/browse/LUCENE-2790 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2790.patch, LUCENE-2790.patch > > > Spin off from here: > http://www.gossamer-threads.com/lists/lucene/java-dev/112311. > I will attach a patch shortly that addresses the issue on trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
[ https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966108#action_12966108 ] Earwin Burrfoot edited comment on LUCENE-2790 at 12/2/10 8:12 AM: -- Check this patch out. It changes useCompoundFile(SIS, SI) to respect noCFSRatio and drops useCompoundFile from OneMerge, so all decisions about using compound files now happen in a single place. It also highlights the problem with your patch - when calling useCompoundFile from addIndexes, you should hold a lock, so segmentInfos won't be modified while mergePolicy inspects them. was (Author: earwin): Check this patch out. It moves noCFS ratio to useCompoundFile(SIS, SI) and drops useCompoundFile from OneMerge, so all decisions about using compound files now happen in a single place. It also highlights the problem with your patch - when calling useCompoundFile from addIndexes, you should hold a lock, so segmentInfos won't be modified while mergePolicy inspects them. > IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile > --- > > Key: LUCENE-2790 > URL: https://issues.apache.org/jira/browse/LUCENE-2790 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2790.patch, LUCENE-2790.patch > > > Spin off from here: > http://www.gossamer-threads.com/lists/lucene/java-dev/112311. > I will attach a patch shortly that addresses the issue on trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
[ https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966103#action_12966103 ] Earwin Burrfoot commented on LUCENE-2790: - Fails addIndexesWithThreads with ConcurrentModificationException, if MergePolicy actually tries to iterate infos passed to useCompoundFile(SIS, SI). > IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile > --- > > Key: LUCENE-2790 > URL: https://issues.apache.org/jira/browse/LUCENE-2790 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2790.patch > > > Spin off from here: > http://www.gossamer-threads.com/lists/lucene/java-dev/112311. > I will attach a patch shortly that addresses the issue on trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2789) Let codec decide to use compound file system or not
[ https://issues.apache.org/jira/browse/LUCENE-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966071#action_12966071 ] Earwin Burrfoot commented on LUCENE-2789: - I'd like to a see a switch like setNeverEverUseCompoundFiles(true) somewhere. We'll still need to copypaste CFS handling code to each new Codec :/ > Let codec decide to use compound file system or not > --- > > Key: LUCENE-2789 > URL: https://issues.apache.org/jira/browse/LUCENE-2789 > Project: Lucene - Java > Issue Type: Improvement > Components: Codecs, Index >Reporter: Simon Willnauer > > While working on LUCENE-2186 and in the context of recent [mails | > http://www.lucidimagination.com/search/document/e75cfa6050d5176/consolidate_mp_and_lmp#97c69a198952ebaa] > about consolidating MergePolicy and LogMergePolicy I wanna propose a rather > big change how Compund Files are created / handled in IW. Since Codecs have > been introduced we have several somewhat different way of how data is written > to the index. Sep codec for instance writes different files for index data > and DocValues will write one file per field and segment. Eventually codecs > need to have more control over how files are written ie. if CFS should be > used or not is IMO really a matter of the codec used for writing. > On the other hand when you look at IW internals CFS really pollutes the > indexing code and relies on information from inside a codec (see > SegmentWriteState.flusedFiles) actuall this differentiation spreads across > many classes related to indexing including the LogMergePolicy. IMO how new > flushed segments are written has nothing to do with MP in the first place and > MP currently choses whether a newly flushed segment is CFS or not (correct me > if I am wrong), pushing all this logic down to codecs would make lots of code > much easier and cleaner. > As mike said this would also reduce the API footprint if we make it private > to the codec. I can imagine some situations where you really want control > over certain fields to be stored as non-CFS and other to be stored as CFS. > Codecs might need more information about other segments during a merge to > decide if or not to use CFS based on the segments size but we can easily > change that API. From a reading point of view we already have Codec#files > that can decide case by case what files belong to this codec. > let me know the thoughts -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2779) Use ConcurrentHashMap in RAMDirectory
[ https://issues.apache.org/jira/browse/LUCENE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965380#action_12965380 ] Earwin Burrfoot commented on LUCENE-2779: - bq. So I ended up writing the following code and comment Looks good. > Use ConcurrentHashMap in RAMDirectory > - > > Key: LUCENE-2779 > URL: https://issues.apache.org/jira/browse/LUCENE-2779 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2779-backwardsfix.patch, LUCENE-2779.patch, > LUCENE-2779.patch, LUCENE-2779.patch, TestCHM.java > > > RAMDirectory synchronizes on its instance in many places to protect access to > map of RAMFiles, in addition to updating the sizeInBytes member. In many > places the sync is done for 'read' purposes, while only in few places we need > 'write' access. This looks like a perfect use case for ConcurrentHashMap > Also, syncing around sizeInBytes is unnecessary IMO, since it's an AtomicLong > ... > I'll post a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2779) Use ConcurrentHashMap in RAMDirectory
[ https://issues.apache.org/jira/browse/LUCENE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965296#action_12965296 ] Earwin Burrfoot commented on LUCENE-2779: - Quoting Sun JDK 1.6: {code} public ArrayList(Collection c) { elementData = c.toArray(); size = elementData.length; // c.toArray might (incorrectly) not return Object[] (see 6260652) if (elementData.getClass() != Object[].class) elementData = Arrays.copyOf(elementData, size, Object[].class); } {code} It calls toArray() on collection provided. You might as well skip wrapping with ArrayList and use toArray directly :D > Use ConcurrentHashMap in RAMDirectory > - > > Key: LUCENE-2779 > URL: https://issues.apache.org/jira/browse/LUCENE-2779 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2779-backwardsfix.patch, LUCENE-2779.patch, > LUCENE-2779.patch, LUCENE-2779.patch, TestCHM.java > > > RAMDirectory synchronizes on its instance in many places to protect access to > map of RAMFiles, in addition to updating the sizeInBytes member. In many > places the sync is done for 'read' purposes, while only in few places we need > 'write' access. This looks like a perfect use case for ConcurrentHashMap > Also, syncing around sizeInBytes is unnecessary IMO, since it's an AtomicLong > ... > I'll post a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2785) TopFieldCollector throws AIOOBE if numHits is 0
[ https://issues.apache.org/jira/browse/LUCENE-2785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965202#action_12965202 ] Earwin Burrfoot commented on LUCENE-2785: - "A Collector, that counts" - priceless. And, yes, I have one of these too. > TopFieldCollector throws AIOOBE if numHits is 0 > --- > > Key: LUCENE-2785 > URL: https://issues.apache.org/jira/browse/LUCENE-2785 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.1, 4.0 > > > See solr-user thread "ArrayIndexOutOfBoundsException for query with rows=0 > and sort param". > I think we should just create a null collector (only tallies up totalHits) if > numHits is 0? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2779) Use ConcurrentHashMap in RAMDirectory
[ https://issues.apache.org/jira/browse/LUCENE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965194#action_12965194 ] Earwin Burrfoot commented on LUCENE-2779: - bq. Cloning the keySet() will be exactly the 'snapshotty' behavior we're looking for. Before I made the change, you could call listAll(), lock RAMDir, return the array and before/after that files could be added/removed. W/ the clone, we'll get the same behavior - files can be added/removed before the clone, clone would reflect those changes, whatever happens after the clone is invisible to the iterator - hence why I consider it snapshotty. There are still weird cases, when file B was added after deleting A, but you see both in listAll(). These - remain, so it's not a "point in time" it's more like a "span in time". Whatever happened after toArray was invisible to array too, so the behaviour hasn't changed. bq. Also, what do you know, I've hit an AIOB exception thrown from listAll() when it called toArray() :) But _this_ fact is really interesting. toArray() with no parameters failed on CHM? Cloning has a meaning now :) bq. Earwin, I did not *just* backport it. Didn't mean to offend anyone, sorry if I did. > Use ConcurrentHashMap in RAMDirectory > - > > Key: LUCENE-2779 > URL: https://issues.apache.org/jira/browse/LUCENE-2779 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2779-backwardsfix.patch, LUCENE-2779.patch, > LUCENE-2779.patch, LUCENE-2779.patch, TestCHM.java > > > RAMDirectory synchronizes on its instance in many places to protect access to > map of RAMFiles, in addition to updating the sizeInBytes member. In many > places the sync is done for 'read' purposes, while only in few places we need > 'write' access. This looks like a perfect use case for ConcurrentHashMap > Also, syncing around sizeInBytes is unnecessary IMO, since it's an AtomicLong > ... > I'll post a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2779) Use ConcurrentHashMap in RAMDirectory
[ https://issues.apache.org/jira/browse/LUCENE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964982#action_12964982 ] Earwin Burrfoot commented on LUCENE-2779: - Maybe we should commit it to 4.0 only? Doesn't look like a really important patch, that just _has_ to be backported. > Use ConcurrentHashMap in RAMDirectory > - > > Key: LUCENE-2779 > URL: https://issues.apache.org/jira/browse/LUCENE-2779 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2779-backwardsfix.patch, LUCENE-2779.patch, > LUCENE-2779.patch, TestCHM.java > > > RAMDirectory synchronizes on its instance in many places to protect access to > map of RAMFiles, in addition to updating the sizeInBytes member. In many > places the sync is done for 'read' purposes, while only in few places we need > 'write' access. This looks like a perfect use case for ConcurrentHashMap > Also, syncing around sizeInBytes is unnecessary IMO, since it's an AtomicLong > ... > I'll post a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2779) Use ConcurrentHashMap in RAMDirectory
[ https://issues.apache.org/jira/browse/LUCENE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964920#action_12964920 ] Earwin Burrfoot commented on LUCENE-2779: - I don't believe cloning the keySet() will make it more "snapshotty". There's no way to get a completely consistent snapshot of some concurrent datastructure without locking it completely, or using a variant of copy-on-write approach. > Use ConcurrentHashMap in RAMDirectory > - > > Key: LUCENE-2779 > URL: https://issues.apache.org/jira/browse/LUCENE-2779 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2779.patch, LUCENE-2779.patch, TestCHM.java > > > RAMDirectory synchronizes on its instance in many places to protect access to > map of RAMFiles, in addition to updating the sizeInBytes member. In many > places the sync is done for 'read' purposes, while only in few places we need > 'write' access. This looks like a perfect use case for ConcurrentHashMap > Also, syncing around sizeInBytes is unnecessary IMO, since it's an AtomicLong > ... > I'll post a patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2781) Drop deprecations from trunk
[ https://issues.apache.org/jira/browse/LUCENE-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964755#action_12964755 ] Earwin Burrfoot commented on LUCENE-2781: - Ok, restore it then and fix deprecation note to "will drop it in Lucene 5.0" or something? > Drop deprecations from trunk > > > Key: LUCENE-2781 > URL: https://issues.apache.org/jira/browse/LUCENE-2781 > Project: Lucene - Java > Issue Type: Task >Affects Versions: 4.0 >Reporter: Earwin Burrfoot >Assignee: Robert Muir >Priority: Minor > Attachments: drop-deprecations.patch, drop-deprecations.patch, > drop-deprecations.patch, drop-deprecations.patch > > > subj. > Also, to each remaining deprecation add release version when it first > appeared. > Patch incoming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2781) Drop deprecations from trunk
[ https://issues.apache.org/jira/browse/LUCENE-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964751#action_12964751 ] Earwin Burrfoot commented on LUCENE-2781: - Hmm.. and regarding this exact case. DutchStemFilter.java had this: bq. @deprecated Use {...@link SnowballFilter} with {...@link org.tartarus.snowball.ext.DutchStemmer} instead, which has the same functionality. considering the following snippet from DutchStemmer.java: bq. The algorithm is an implementation of the http://snowball.tartarus.org/algorithms/dutch/stemmer.html";>dutch stemming algorithm in Martin Porter's snowball project. we can reliably drop DutchStemFilter+DutchStemmer, people will fix their code to use SnowballFilter and happily read their 3.0 indexes. No? > Drop deprecations from trunk > > > Key: LUCENE-2781 > URL: https://issues.apache.org/jira/browse/LUCENE-2781 > Project: Lucene - Java > Issue Type: Task >Affects Versions: 4.0 >Reporter: Earwin Burrfoot >Assignee: Robert Muir >Priority: Minor > Attachments: drop-deprecations.patch, drop-deprecations.patch, > drop-deprecations.patch, drop-deprecations.patch > > > subj. > Also, to each remaining deprecation add release version when it first > appeared. > Patch incoming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org