[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845745#action_12845745 ] Michael Busch commented on LUCENE-2312: --- The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks). I think I have a lock-free solution working, which only syncs in certain intervals to not prevent JVM optimizations - but I need more time for thinking about all the combinations and corner cases. It's getting late now - need to sleep! > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845745#action_12845745 ] Michael Busch edited comment on LUCENE-2312 at 3/16/10 6:51 AM: The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks). I think I have a lock-free solution working, which only syncs (i.e. does volatile writes) in certain intervals to not prevent JVM optimizations - but I need more time for thinking about all the combinations and corner cases. It's getting late now - need to sleep! was (Author: michaelbusch): The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks). I think I have a lock-free solution working, which only syncs in certain intervals to not prevent JVM optimizations - but I need more time for thinking about all the combinations and corner cases. It's getting late now - need to sleep! > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845735#action_12845735 ] Jason Rutherglen commented on LUCENE-2312: -- {quote}This makes the reference to the array volatile, not the slots in the array{quote} That's no good! :) {quote}If you use a RW lock then the writer thread will block all reader threads while it's making changes{quote} We probably need to implement more fine grained locking, perhaps using volatile booleans instead of RW locks. Fine grained meaning on the byte array/block level. I think this would imply that changes are not visible until a given byte block is more or less "flushed"? This is different than the design that's been implicated, that we'd read from byte arrays as their being written to. We probably don't need to read from and write to the same byte array concurrently (that might not be feasible?). The performance win here is probably going to be the fact that we avoid segment merges. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845731#action_12845731 ] Michael Busch commented on LUCENE-2312: --- {quote} Do volatile byte arrays work {quote} I'm not sure what you mean by volatile byte arrays? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845731#action_12845731 ] Michael Busch edited comment on LUCENE-2312 at 3/16/10 6:12 AM: {quote} Do volatile byte arrays work {quote} I'm not sure what you mean by volatile byte arrays? Do you mean this? {code} volatile byte[] array; {code} This makes the *reference* to the array volatile, not the slots in the array. was (Author: michaelbusch): {quote} Do volatile byte arrays work {quote} I'm not sure what you mean by volatile byte arrays? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845729#action_12845729 ] Jason Rutherglen commented on LUCENE-2312: -- {quote}but my goal is it here to implement a non-blocking and lock-free algorithm. So my idea was it to make use of a very subtle behavior of volatile variables. {quote} You're talking about having a per thread write buffer byte array, that on search gets copied into a read only array, or gets transformed magically into a volatile byte array? (Do volatile byte arrays work? I couldn't find a clear answer on the net, maybe it's stated in the Goetz book). If volatile byte arrays do work, an option to test would be a byte buffer pool that uses volatile byte arrays? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845726#action_12845726 ] Michael Busch commented on LUCENE-2312: --- {quote} A quick and easy way to solve this is to use a read write lock on the byte pool? {quote} If you use a RW lock then the writer thread will block all reader threads while it's making changes. The writer thread will be making changes all the time in a real-time search environment. The contention will kill performance I'm sure. RW lock is only faster than mutual exclusion lock if writes are infrequent, as mentioned in the javadocs of ReadWriteLock.java > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845721#action_12845721 ] Jason Rutherglen commented on LUCENE-2312: -- Just to clarify, I think Mike's referring to ParallelArray? http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/extra166y/P arallelArray.html There's AtomicIntegerArray: http://www.melclub.net/java/_atomic_integer_array_8java_source.html which underneath uses the sun.Unsafe class for volatile array access. Could this be reused for an AtomicByteArray class (why isn't there one of these already?). A quick and easy way to solve this is to use a read write lock on the byte pool? Remember when we'd sync on each read bytes call to the underlying random access file in FSDirectory (eg, now we're using NIOFSDir which can be a good concurrent throughput improvement). Lets try the RW lock and examine the results? I guess the issue is we're not writing in blocks of bytes, we're actually writing byte by byte and need to read byte by byte concurrently? This sounds like a fairy typical thing to do? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845712#action_12845712 ] Michael Busch commented on LUCENE-2312: --- {quote} Hmm... what does JMM say about byte arrays? If one thread is writing to the byte array, can any other thread see those changes? {quote} This is the very right question to ask here. Thread-safety is really the by far most complicated aspect of this feature. Jason, I'm not sure if you already figured out how to ensure visibility of changes made by the writer thread to the reader threads? Thread-safety in our case boils down to safe publication. We don't need locking to coordinate writing of multiple threads, because of LUCENE-2324. But we need to make sure that the reader threads see all changes they need to see at the right time, in the right order. This is IMO very hard, but we all like challenges :) The JMM gives no guarantee whatsover what changes a thread will see that another thread made - or if it will ever see the changes, unless proper publication is ensured by either synchronization or volatile/atomic variables. So e.g. if a writer thread executes the following statements: {code} public static int a, b; ... a = 1; b = 2; a = 5; b = 6; {code} and a reader threads does: {code} System.out.println(a + "," + b); {code} The thing to remember is that the output might be: 1,6! Another reader thread with the following code: {code} while (b != 6) { .. do something } {code} might further NEVER terminate without synchronization/volatile/atomic. The reason is that the JVM is allowed to perform any reorderings to utilize modern CPUs, memory, caches, etc. if not forced otherwise. To ensure safe publication of data written by a thread we could do synchronization, but my goal is it here to implement a non-blocking and lock-free algorithm. So my idea was it to make use of a very subtle behavior of volatile variables. I will take a simple explanation of the JMM from Brian Goetz' awesome book "Java concurrency in practice", in which he describes the JMM in simple happens-before rules. I will mention only three of those rules, because they are enough to describe the volatile behavior I'd like to mention here (p. 341) *Program order rule:* Each action in a thread _happens-before_ every action in that thread that comes later in the program order. *Volatile variable rule:* A write to a volatile field _happens-before_ every subsequent read of that same field. *Transitivity:* If A happens-before B, and B _happens-before_ C, then A _happens-before_ C. Based on these three rules you can see that writing to a volatile variable v by one thread t1 and subsequent reading of the same volatile variable v by another thread t2 publishes ALL changes of t1 that happened-before the write to v and the change of v itself. So this write/read of v means crossing a memory barrier and forcing everything that t1 might have written to caches to be flushed to the RAM. That's why a volatile write can actually be pretty expensive. Note that this behavior is actually only working like I just described since Java 1.5. Behavior of volatile variables was a very very subtle change from 1.4->1.5! The way I'm trying to make use of this behavior is actually similar to how we lazily sync Lucene's files with the filesystem: I want to delay the cache->RAM write-through as much as possible, which increases the probability of getting the sync for free! Still fleshing out the details, but I wanted to share these infos with you guys already, because it might invalidate a lot of assumptions you might have when developing the code. Some of this stuff was actually new to me, maybe you all know it already. And if anything that I wrote here is incorrect, please let me know! Btw: IMO, if there's only one java book you can ever read, then read Goetz' book! It's great. He also says in the book somewhere about lock-free algorithms: "Don't try this at home!" - so, let's do it! :) > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845703#action_12845703 ] Michael Busch commented on LUCENE-2312: --- {quote} Sounds like awesome progress!! Want some details over here :) {quote} Sorry for not being very specific. The prototype I'm experimenting with has a fixed length postings format for the in-memory representation (in TermsHash). Basically every posting has 4 bytes, so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are used for an absolute docID (not delta-encoded). This limits the max in-memory segment size to 2^24 docs. The 1 remaining byte is used for the position. With a max doc length of 140 characters you can fit every possible position in a byte - what a luxury! :) If a term occurs multiple times in the same doc, then the TermDocs just skips multiple occurrences with the same docID and increments the freq. Again, the same term doesn't occur often in super short docs. The int[] slices also don't have forward pointers, like in Lucene's TermsHash, but backwards pointers. In real-time search you often want a strongly time-biased ranking. A PostingList object has a pointer that points to the last posting (this statement is not 100% correct for visibility reasons across threads, but we can imagine it this way for now). A TermDocs can now traverse the postinglists in opposite order. Skipping can be done by following pointers to previous slices directly, or by binary search within a slice. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity
[ https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845702#action_12845702 ] Shai Erera commented on LUCENE-2310: i like the idea of Document to implement Iterable, but how does that solve the case where someone wants to query how many fields a document has? Will you still have getFields(), only now it will return an unmodifiable collection? I guess the unmod collection can be returned even today, right? BTW, what happens if getFields() return an unmod collection, but someone calls doc.add(Field)? I think the unmod collection prevents you from adding to that collection wrapper, but not for that collection to be changed from under the hood? If that's true, then that could cause some trouble ... so getFields() will really return a snapshot of Document, which means we need to clone Fields ... Gets too complicated no? Maybe just do: (1) Doc implements Iterable and (2) Doc exposes numFIelds(), add(Field)? About remove(field), I thought of a possible scenario though I still don't think it's interesting enough - suppose that you pass your Document through a processing pipeline/chain, each handler adds fields as metadata to the Document. For example, annotators. It might be that a field A exists, only for a handler down the chain to understand A's meaning and then replace it w/ A1 and A2. For that you'll want to be able to move a field ... I guess we could add a remove method to Document, and if it'll be called while the fields are iterated on, a CME will be thrown, which is perfectly fine with me. > Reduce Fieldable, AbstractField and Field complexity > > > Key: LUCENE-2310 > URL: https://issues.apache.org/jira/browse/LUCENE-2310 > Project: Lucene - Java > Issue Type: Sub-task > Components: Index >Reporter: Chris Male > Attachments: LUCENE-2310-Deprecate-AbstractField.patch, > LUCENE-2310-Deprecate-AbstractField.patch, > LUCENE-2310-Deprecate-AbstractField.patch > > > In order to move field type like functionality into its own class, we really > need to try to tackle the hierarchy of Fieldable, AbstractField and Field. > Currently AbstractField depends on Field, and does not provide much more > functionality that storing fields, most of which are being moved over to > FieldType. Therefore it seems ideal to try to deprecate AbstractField (and > possible Fieldable), moving much of the functionality into Field and > FieldType. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845696#action_12845696 ] Jason Rutherglen commented on LUCENE-2312: -- Payloads works (non-lazy loading), however ByteSliceReader doesn't implement a seek method so I think we simply need to load each payload as we increment nextPosition? The cost shouldn't be too much because we're simply copying small byte arrays (in the heap). > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
: prime-time as the new solr trunk! Lucene and Solr need to move to a : common trunk for a host of reasons, including single patches that can : cover both, shared tags and branches, and shared test code w/o a test : jar. Without a clearer picture of how people envision development "overhead" working as we move forward, it's really hard to understand how any of these ideas make sense... 1) how should hte automated build process(es) work? 2) how are we going to do branching/tagging for releases? particularly in situations where one product is ready for a rlease and hte other isn't? 3) how are we going to deal with mino bug fix release tagging? 4) should it be possible for people to check out Lucene-Java w/o checking out Solr? (i suspect a whole lot of people who only care about the core library are going to really adamantly not want to have to check out all of Solr just to work on the core) : Both projects move to a new trunk: : /something/trunk/java, /something/trunk/solr by gut says something like this will more the most sense, assuming "/something/trunk" == "/java/trunk" and "java" actually means "core" ... ie: this discussion should really be part and parcel with how contribs should be reorged. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
On Mon, Mar 15, 2010 at 11:41 PM, Mark Miller wrote: >> >> Solr moves to Lucene's trunk: >> /java/trunk, /java/trunk/sol > > +1. With the goal of merged dev, merged tests, this looks the best to me. > Simple to do patches that span both, simple to setup > Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it. > +1 -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
On 03/15/2010 11:28 PM, Yonik Seeley wrote: So, we have a few options on where to put Solr's new trunk: Solr moves to Lucene's trunk: /java/trunk, /java/trunk/sol +1. With the goal of merged dev, merged tests, this looks the best to me. Simple to do patches that span both, simple to setup Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
lucene and solr trunk
Due to a tremendous amount of work by our newly merged committer corps, the get-on-lucene-trunk branch (branches/solr) is ready for prime-time as the new solr trunk! Lucene and Solr need to move to a common trunk for a host of reasons, including single patches that can cover both, shared tags and branches, and shared test code w/o a test jar. The current Lucene trunk is: .../lucene/java/trunk The current Solr trunk is: .../lucene/solr/trunk So, we have a few options on where to put Solr's new trunk: Lucene moves to Solr's trunk: /solr/trunk, /solr/trunk/lucene Solr moves to Lucene's trunk: /java/trunk, /java/trunk/solr Both projects move to a new trunk: /something/trunk/java, /something/trunk/solr -Yonik - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845680#action_12845680 ] Jason Rutherglen commented on LUCENE-2312: -- In thinking about the terms dictionary, we're going to run into concurrency issues right if we just use TreeMap? Can't we simply use the lock free ConcurrentSkipListMap? Yeah it's a part of Java6 however why reinvent the wheel? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845679#action_12845679 ] Jason Rutherglen commented on LUCENE-2312: -- Basic term positions working, need to figure out how to do lazy loading payloads... > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845663#action_12845663 ] Jason Rutherglen commented on LUCENE-2312: -- I have a test case showing the term docs working... I'm going to try to add the term positions methods. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote: > I mean specifically one should not have to commit to the precise > scoring model they will use for a given field, when they index that > field. Yeah, I've never seen committing to a precise scoring model at index-time via Sim choice as a big deal. In Lucy, per-field Similarity assignments are part of the the Schema, which has to be set at index-time. And index-time Sim choice is the way things have always been done in Lucene. In any case, the proposal to start delaying Sim choice to search-time -- while a nice feature for Lucene -- is a non-starter for Lucy. We can't do that because it would kill the cheap-Searcher model to generate boost bytes at Searcher construction time and cache them within the object. We need those boost bytes written to disk so we can mmap them and share them amongst many cheap Searchers. So... you're proposing shrinking Similarity's public API by removing functionality that Lucy can't live without. If indeed that works out for Lucene, the role of Similarity within the two libraries will have to diverge. In Lucene, Similarity will get smaller; in Lucy it will expand a bit. To my mind, these are all related data reduction tasks: * Omit doc-boost and field-boost, replacing them with a single float docXfield multiplier -- because you never need doc-boost on its own. * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost, replacing them all with a single boost byte -- because for the kind of scoring you want to do, you don't need all those raw stats. * Omit the boost byte, because you don't need to do scoring at all. * Omit positions because you don't need PhraseQueries, etc. to match. * Omit everything except doc-id, because you only need binary matching. What al those tasks all have in common is that we can determine what stats are disposable based on how the user describes how they are going to use the field. For Lucy, the user is going to have to commit to a "precise scoring model" at index-time by specifying a Sim choice anyway. If that Sim turns out to be a MatchSimilarity, why on earth should we keep around the boost bytes? > > And what class other than Similarity knows enough about the scoring > > algorithm > > to perform these data reduction tasks? If it's not goint to be Similarity > > itself, it has to be something that know absolutely everything about the > > Similarity implementation's scoring model. > > I don't follow this... > > It will be Sim that does computes norm bytes. I meant that if you're writing out boost bytes, there's no sensible way to execute the lossy data reduction and reduce the index size other than having Sim do it. > > class MySim extends Similarity { > >public PostingCodec makePostingCodec() { > > StandardPostingCodec codec = new StandardPostingCodec(); > > codec.setOmitBoostBytes(true); > > codec.setOmitPositions(true); > > return (PostingCodec)codec; > >} > > } > > This still feels like you are mixing two very different concepts -- > what's being written (boost bytes, positions, docTermFreqs) vs how it's > encoded (codec). So StandardPostingCodec shouldn't have methods like setOmitBoostBytes()? Maybe that's right. Guess I'll watch to see how flex pans out and what methods you put on those PostingCodec classes. For now, I just want to make the no-boost-bytes and doc-id-only index optimizations available, and to achieve that, it's sufficient to implement format-follows-sim and publish MatchSimilarity and MinimalSimilarity. The PostingCodec API can remain a private implementation detail until a later date. > Shouldn't Lucy's schema record what stats should be indexed for the field? No, it shouldn't -- not directly. You tell the Schema how you want the field to be used. That information is used to derive what stats are needed, and whether the ones that are needed can be combined, compressed, etc. > Then, any codec you swap in should respect that? EG maybe I use PForCodec > instead, or a PulsingCode(PForCodec)? I guess. I don't see publishing a PForCodec with an elaborate API as being very important, though. It's more important to just use PFOR internally when it's the best choice. > I'm thinking the various Sim classes, which you'd select during > searching, will note in jdocs what attrs must be indexed. It's your > job to read that and set your field (schema) up accordingly, ie, > enable those required attrs. Yeah, that'll at least get the job done for Lucene. I don't think it's ideal to force people to understand that stuff, but hey, the more people are confused, the more important it is for them to buy optimization seminars where Lucene gurus explain all the obscure incantations to them. :) > > You seem to be fixated on the notion of swapping in a MatchOnlySim object at > > search time. You can't do that in KS/Lucy, because you can't modify a > > Schema > > at
[jira] Updated: (LUCENE-2098) make BaseCharFilter more efficient in performance
[ https://issues.apache.org/jira/browse/LUCENE-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2098: Attachment: LUCENE-2098.patch i haven't benchmarked to see if this is any faster, maybe even worse. but its no longer a linear algorithm > make BaseCharFilter more efficient in performance > - > > Key: LUCENE-2098 > URL: https://issues.apache.org/jira/browse/LUCENE-2098 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 2.9 >Reporter: Koji Sekiguchi >Priority: Minor > Attachments: LUCENE-2098.patch > > > Performance degradation in Solr 1.4 was reported. See: > http://www.lucidimagination.com/search/document/43c4bdaf5c9ec98d/html_stripping_slower_in_solr_1_4 > The inefficiency has been pointed out in BaseCharFilter javadoc by Mike: > {panel} > NOTE: This class is not particularly efficient. For example, a new class > instance is created for every call to addOffCorrectMap(int, int), which is > then appended to a private list. > {panel} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [DISCUSS] Do away with Contrib Committers and make core committers
> > Personally I'd prefer we just stop adding them, and the current ones work > their way up like normal if they are so inclined, or the ones that are not > even around anymore can just stay as they are. > This seems reasonable to me. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845530#action_12845530 ] Earwin Burrfoot commented on LUCENE-2320: - We could split MergePolicy in two - class that represents the policy (config/factory) and class that acts on that policy (instance). So IW gets a MergePolicy that has no outside references, and creates a MergePoliceman from it, supplying 'this' on construction. Thus, circular reference still exists, but is contained for good. Not sure I totally love the idea myself though. > Add MergePolicy to IndexWriterConfig > > > Key: LUCENE-2320 > URL: https://issues.apache.org/jira/browse/LUCENE-2320 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2320.patch > > > Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as > well. The change is not straightforward and so I've kept it for a separate > issue. MergePolicy requires in its ctor an IndexWriter, however none can be > passed to it before an IndexWriter actually exists. And today IW may create > an MP just for it to be overridden by the application one line afterwards. I > don't want to make iw member of MP non-final, or settable by extending > classes, however it needs to remain protected so they can access it directly. > So the proposed changes are: > * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set > once (hence its name). It'll have the signature SetOnce w/ *synchronized > set* and *T get()*. T will be declared volatile, so that get() won't be > synchronized. > * MP will define a *protected final SetOnce writer* instead of > the current writer. *NOTE: this is a bw break*. any suggestions are welcomed. > * MP will offer a public default ctor, together with a set(IndexWriter). > * IndexWriter will set itself on MP using set(this). Note that if set will be > called more than once, it will throw an exception (AlreadySetException - or > does someone have a better suggestion, preferably an already existing Java > exception?). > That's the core idea. I'd like to post a patch soon, so I'd appreciate your > review and proposals. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845503#action_12845503 ] Jason Rutherglen commented on LUCENE-2312: -- Also wanted to add that the PostingList lastDocID is correct. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [DISCUSS] Do away with Contrib Committers and make core committers
My 2 cents as one who has no aspirations of ever being a committer. I think with the pending re-org of contrib and the value of contrib, it doesn't make much sense to have the distinction between core and contrib let alone for contributors. Regarding the former low bar, either prune the list (voluntarily or forcefully), prune individuals when they commit something they really, really shouldn't have (e.g. no discussion, no consensus), or give several opportunities to do right then prune. But in any case, spell out the expectations and document it (perhaps in the wiki). I think it can work and there will be little if any problem with it. -- DM On 03/15/2010 02:33 PM, Grant Ingersoll wrote: On Mar 15, 2010, at 1:25 PM, Mark Miller wrote: On 03/15/2010 08:33 AM, Grant Ingersoll wrote: Right, Mark. I think we would be effectively raising the bar to some extent for what it takes to be a committer. That's part of my point though - some are contrib committers with a lower bar - now they are core/solr committers with that lower bar, but someone else that came along would not get to the same position now? I think they may just have a little more work to do, either that or maybe we just have a little more faith that the right things will be done. We'd also be making contrib a first class citizen (not that it ever wasn't, but some people have that perception). I think because it was kind of true. I could come along before and donate contrib x, and never show I worked well with the community or build up the merit needed to be a committer, and be made a contrib committer simply to maintain my module. That's happened plenty. True. I guess what I'm saying is we can still make them committers and it may be that they still only will work on "their" module, but we should base our vote on them being "full" committers. I don't like the notion of modules belonging to someone (not that you were implying that, I know.) I guess I just see it as you either have earned merit or not. That's how we do it in Solr and Mahout and they both have modules/contribs and it also fits more with the notion of "one project, one set of committers". Finally, I think we need to recognize that not everyone needs to be a McCandless in order to contribute in a helpful way. We obviously recognize that or else I wouldn't be here! I think its more about fitting in - showing you get and follow the Apache way. Showing that ideas and changes you might push are in line with what the other committers thing is appropriate of a core/solr committer. Talent is not key here - community is. The bar for this has been *much* higher core than contrib in the past. And contrib has had different bars over time - I think it was even lower in the past at points. Agreed. I think sometimes we forget that you can do svn revert. I hate to have to do that. I don't think its a great way to handle this - we could make everyone a committer at a drop of a hat and say we can just revert. I wouldn't call for a revert except in exceptional circumstances. I don't think that's the point. Right, obviously I wasn't implying we'd want to do it, but we can if it is absolutely necessary. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845493#action_12845493 ] Jason Rutherglen commented on LUCENE-2312: -- {quote}Ahh, I think it's because you're not calling compactPostings/sortPostings in the THPF, right? Those methods collapse the hash table in-place (ie move all the nulls out), and sort.{quote} Yep, got that part. {quote}So you have to re-work the code to not do that and instead use whatever structure you have for visiting terms in sorted order. Then stepping through the docs should just work, but, you gotta stop at the max docID, right?{quote} Right, the terms in sorted order is working... The freq ByteSliceReader is reading nothing however (zeroes). Either it's init'ed to the wrong position, or there's nothing in there? Or something else. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845490#action_12845490 ] Shai Erera commented on LUCENE-2320: The thing is that we were at that position already, before I changed it so that MP requires writer up front. The reason was, like Mike mentioned, that writer had to be passed on all method calls, for really no good reason. A MP is usually coupled w/ an IW instance and I don't think we should opt for decoupling them. Most of this patch removes MP setting from IW to IWC (and hence changes test code to use the new API). The SetOnce juggling is done only to ensure an IW is set exactly once on MP, and allows us to resolve that circular dependency. We can do two things: # Continue w/ SetOnce as introduced in this patch. # Introduce a setIndexWriter on MP which anyone can call, even more than once. With (1) I don't think we complicate anything, and SetOnce can be useful in other places as well. (2) is really like passing writer on all method calls, so let's at least not have it as part of all methods signature. I prefer (1) slightly over (2) but am fine w/ (2) as well. I wouldn't want to change MP back to require IW on all its methods. > Add MergePolicy to IndexWriterConfig > > > Key: LUCENE-2320 > URL: https://issues.apache.org/jira/browse/LUCENE-2320 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2320.patch > > > Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as > well. The change is not straightforward and so I've kept it for a separate > issue. MergePolicy requires in its ctor an IndexWriter, however none can be > passed to it before an IndexWriter actually exists. And today IW may create > an MP just for it to be overridden by the application one line afterwards. I > don't want to make iw member of MP non-final, or settable by extending > classes, however it needs to remain protected so they can access it directly. > So the proposed changes are: > * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set > once (hence its name). It'll have the signature SetOnce w/ *synchronized > set* and *T get()*. T will be declared volatile, so that get() won't be > synchronized. > * MP will define a *protected final SetOnce writer* instead of > the current writer. *NOTE: this is a bw break*. any suggestions are welcomed. > * MP will offer a public default ctor, together with a set(IndexWriter). > * IndexWriter will set itself on MP using set(this). Note that if set will be > called more than once, it will throw an exception (AlreadySetException - or > does someone have a better suggestion, preferably an already existing Java > exception?). > That's the core idea. I'd like to post a patch soon, so I'd appreciate your > review and proposals. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2325) investigate solr test failures using flex
[ https://issues.apache.org/jira/browse/LUCENE-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2325. Resolution: Fixed Solr can now run on flex :) > investigate solr test failures using flex > - > > Key: LUCENE-2325 > URL: https://issues.apache.org/jira/browse/LUCENE-2325 > Project: Lucene - Java > Issue Type: Test >Affects Versions: Flex Branch >Reporter: Robert Muir >Assignee: Michael McCandless > Fix For: Flex Branch > > Attachments: LUCENE-2325.patch, LUCENE-2325.patch > > > We have a branch of Solr located here: > https://svn.apache.org/repos/asf/lucene/solr/branches/solr > Currently all the tests pass with lucene trunk jars. > I plopped in the flex jars and they do not, so I thought these might be > interesting to look at. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2325) investigate solr test failures using flex
[ https://issues.apache.org/jira/browse/LUCENE-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2325: --- Attachment: LUCENE-2325.patch The bug was... if you asked for TermsEnum on a non-existent field on a foreign IndexReader (like Solr's, SolrIndexReader), so that the "emulate flex API on top of non-flex API" layer is used, then the returned TermsEnum would incorrectly return 1 term, and then null, when it should've returned null right off. I'll commit shortly -- simple fix. With this all Solr's tests pass when you drop in the flex JARs!! Yay. > investigate solr test failures using flex > - > > Key: LUCENE-2325 > URL: https://issues.apache.org/jira/browse/LUCENE-2325 > Project: Lucene - Java > Issue Type: Test >Affects Versions: Flex Branch >Reporter: Robert Muir >Assignee: Michael McCandless > Fix For: Flex Branch > > Attachments: LUCENE-2325.patch, LUCENE-2325.patch > > > We have a branch of Solr located here: > https://svn.apache.org/repos/asf/lucene/solr/branches/solr > Currently all the tests pass with lucene trunk jars. > I plopped in the flex jars and they do not, so I thought these might be > interesting to look at. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845464#action_12845464 ] Michael McCandless commented on LUCENE-2312: Ahh, I think it's because you're not calling compactPostings/sortPostings in the THPF, right? Those methods collapse the hash table in-place (ie move all the nulls out), and sort. So you have to re-work the code to not do that and instead use whatever structure you have for visiting terms in sorted order. Then stepping through the docs should just work, but, you gotta stop at the max docID, right? Hmm... what does JMM say about byte arrays? If one thread is writing to the byte array, can any other thread see those changes? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845448#action_12845448 ] Jason Rutherglen commented on LUCENE-2312: -- The code is from FreqProxFieldMergeState which accepts in it's constructor FreqProxTermsWriterPerField. One difference is instead of operating on an array of posting lists, the code above assumes one posting list. The numPostings was always 0 when testing {code}this.numPostings = field.termsHashPerField.numPostings;{code} In the code above it's hard coded to 1. Maybe there's some initialization that's not happening correctly? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [DISCUSS] Do away with Contrib Committers and make core committers
On Mar 15, 2010, at 1:25 PM, Mark Miller wrote: > On 03/15/2010 08:33 AM, Grant Ingersoll wrote: >> Right, Mark. I think we would be effectively raising the bar to some extent >> for what it takes to be a committer. > > That's part of my point though - some are contrib committers with a lower bar > - now they are core/solr committers with that lower bar, but someone else > that came along would not get to the same position now? I think they may just have a little more work to do, either that or maybe we just have a little more faith that the right things will be done. > >> We'd also be making contrib a first class citizen (not that it ever wasn't, >> but some people have that perception). > > I think because it was kind of true. I could come along before and donate > contrib x, and never show I worked well with the community or build up the > merit needed to be a committer, and be made a contrib committer simply to > maintain my module. That's happened plenty. True. I guess what I'm saying is we can still make them committers and it may be that they still only will work on "their" module, but we should base our vote on them being "full" committers. I don't like the notion of modules belonging to someone (not that you were implying that, I know.) I guess I just see it as you either have earned merit or not. That's how we do it in Solr and Mahout and they both have modules/contribs and it also fits more with the notion of "one project, one set of committers". > >> Finally, I think we need to recognize that not everyone needs to be a >> McCandless in order to contribute in a helpful way. > > We obviously recognize that or else I wouldn't be here! I think its more > about fitting in - showing you get and follow the Apache way. Showing that > ideas and changes you might push are in line with what the other committers > thing is appropriate of a core/solr committer. Talent is not key here - > community is. The bar for this has been *much* higher core than contrib in > the past. And contrib has had different bars over time - I think it was even > lower in the past at points. Agreed. > >> I think sometimes we forget that you can do svn revert. > > I hate to have to do that. I don't think its a great way to handle this - we > could make everyone a committer at a drop of a hat and say we can just > revert. I wouldn't call for a revert except in exceptional circumstances. I > don't think that's the point. Right, obviously I wasn't implying we'd want to do it, but we can if it is absolutely necessary. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845432#action_12845432 ] Michael McCandless commented on LUCENE-2312: I don't see anything obviously wrong -- you excised this code from the same code that's used when merging the postings during flush? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845428#action_12845428 ] Michael McCandless commented on LUCENE-2324: {quote} bq. Seems ilke it's 8 bytes Object header is two words, so that's 16bytes for 64bit arch. (probably 12 for 64bit+CompressedOops?) {quote} Right, and the pointer'd also be 8 bytes (but compact int stays at 4 bytes) so net/net on 64bit JRE savings would be 16-20 bytes per term. Another thing we could do if we cutover to parallel arrays is to switch to packed ints. Many of these fields are horribly wasteful as ints, eg docFreq or lastPosition. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845426#action_12845426 ] Michael McCandless commented on LUCENE-2324: bq. Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once. Or 2 sets of parallel arrays (one for the singletons) or, something. bq. So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it? And also the GC cost. But it seems like specializing singleton fields will be the bigger win. bq. I was wondering if it makes sense to make these kinds of experiments (pooling vs. non-pooling) with the flex code? Last I tested (a while back now) indexing perf was the same -- need to test again w/ recent changes (eg terms index is switching to packed ints). For pooling vs not I'd just do the experiment on trunk? And most of this change (changing how postings data is buffered in RAM) is "above" flex I expect. But if for some reason you need to start changing index postings format then you should probably do that on flex. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [DISCUSS] Do away with Contrib Committers and make core committers
On 03/15/2010 08:33 AM, Grant Ingersoll wrote: Right, Mark. I think we would be effectively raising the bar to some extent for what it takes to be a committer. That's part of my point though - some are contrib committers with a lower bar - now they are core/solr committers with that lower bar, but someone else that came along would not get to the same position now? We'd also be making contrib a first class citizen (not that it ever wasn't, but some people have that perception). I think because it was kind of true. I could come along before and donate contrib x, and never show I worked well with the community or build up the merit needed to be a committer, and be made a contrib committer simply to maintain my module. That's happened plenty. Finally, I think we need to recognize that not everyone needs to be a McCandless in order to contribute in a helpful way. We obviously recognize that or else I wouldn't be here! I think its more about fitting in - showing you get and follow the Apache way. Showing that ideas and changes you might push are in line with what the other committers thing is appropriate of a core/solr committer. Talent is not key here - community is. The bar for this has been *much* higher core than contrib in the past. And contrib has had different bars over time - I think it was even lower in the past at points. I think sometimes we forget that you can do svn revert. I hate to have to do that. I don't think its a great way to handle this - we could make everyone a committer at a drop of a hat and say we can just revert. I wouldn't call for a revert except in exceptional circumstances. I don't think that's the point. Obviously, we don't want to have to do it often, but it's not a huge deal if it happens. We've all been there. -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org I also wouldn't personally cast my vote on this broadly - some people I might think should be core/solr committers now, others not. Merit at Apache is important - you never lose it. Seems weird to get something like that so easily when in the past you had to work your way to it from contrib committership and get voted on individually by the PMC. Personally I'd prefer we just stop adding them, and the current ones work their way up like normal if they are so inclined, or the ones that are not even around anymore can just stay as they are. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845408#action_12845408 ] Earwin Burrfoot commented on LUCENE-2324: - > Seems ilke it's 8 bytes Object header is two words, so that's 16bytes for 64bit arch. (probably 12 for 64bit+CompressedOops?) Also, GC time is (roughly) linear in number of objects on heap, so replacing single huge array of objects with few huge primitive arrays for their fields does miracles to your GC delays. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845404#action_12845404 ] Jason Rutherglen commented on LUCENE-2312: -- Pre-advanced apology for permanently damaging (well I guess it can be deleted) the look and feel of this issue with a thwack of code, however I don't want to post the messy patch, and I'm guessing there's something small as to why the postings iteration on the freq byte slice reader isn't happening correctly (ie, it's returning 0). {code} public class DWTermDocs implements TermDocs { final FreqProxTermsWriterPerField field; final int numPostings; final CharBlockPool charPool; FreqProxTermsWriter.PostingList posting; char[] text; int textOffset; private int postingUpto = -1; final ByteSliceReader freq = new ByteSliceReader(); final ByteSliceReader prox = new ByteSliceReader(); int docID; int termFreq; DWTermDocs(FreqProxTermsWriterPerField field, FreqProxTermsWriter.PostingList posting) throws IOException { this.field = field; this.charPool = field.perThread.termsHashPerThread.charPool; //this.numPostings = field.termsHashPerField.numPostings; this.numPostings = 1; this.posting = posting; // nextTerm is called only once to // set the term docs pointer at the // correct position nextTerm(); } boolean nextTerm() throws IOException { postingUpto++; if (postingUpto == numPostings) return false; docID = 0; text = charPool.buffers[posting.textStart >> DocumentsWriter.CHAR_BLOCK_SHIFT]; textOffset = posting.textStart & DocumentsWriter.CHAR_BLOCK_MASK; field.termsHashPerField.initReader(freq, posting, 0); if (!field.fieldInfo.omitTermFreqAndPositions) field.termsHashPerField.initReader(prox, posting, 1); // Should always be true boolean result = nextDoc(); assert result; return true; } public boolean nextDoc() throws IOException { if (freq.eof()) { if (posting.lastDocCode != -1) { // Return last doc docID = posting.lastDocID; if (!field.omitTermFreqAndPositions) termFreq = posting.docFreq; posting.lastDocCode = -1; return true; } else // EOF return false; } final int code = freq.readVInt(); if (field.omitTermFreqAndPositions) docID += code; else { docID += code >>> 1; if ((code & 1) != 0) termFreq = 1; else termFreq = freq.readVInt(); } assert docID != posting.lastDocID; return true; } {code} > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845400#action_12845400 ] Michael Busch commented on LUCENE-2324: --- {quote} Sounds great - let's test it in practice. {quote} I have to admit that I need to catch up a bit on the flex branch. I was wondering if it makes sense to make these kinds of experiments (pooling vs. non-pooling) with the flex code? Is it as fast as trunk already, or are there related nocommits left that affect indexing performance? I would think not much of the flex changes should affect the in-memory indexing performance (in TermsHash*). > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845398#action_12845398 ] Michael Busch edited comment on LUCENE-2324 at 3/15/10 4:34 PM: Reply to Mike's comment on LUCENE-2293: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12845263&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845263 {quote} I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed). EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting. {quote} Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once. {quote} I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...). And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access. {quote} Yeah I like that idea. I've done something similar for representing trees - I had a very compact Node class with no data but such a dense ID, and arrays that stored the associated data. Very easy to add another data type with no RAM overhead (you only use the amount of RAM the data needs). Though, the price you pay is for dereferencing multiple times for each array? And how much RAM would we safe? The pointer for the PostingList object (4-8 bytes), plus the size of the object header - how much is that in Java? Seems ilke it's 8 bytes: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it? was (Author: michaelbusch): {quote} I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed). EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting. {quote} Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once. {quote} I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...). And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access. {quote} Yeah I like that idea. I've done something similar for representing trees - I had a very compact Node class with no data but such a dense ID, and arrays that stored the associated data. Very easy to add another data type with no RAM overhead (you only use the amount of RAM the data needs). Though, the price you pay is for dereferencing multiple times for each array? And how much RAM would we safe? The pointer for the PostingList object (4-8 bytes), plus the size of the object header - how much is that in Java? Seems ilke it's 8 bytes: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845398#action_12845398 ] Michael Busch commented on LUCENE-2324: --- {quote} I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed). EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting. {quote} Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once. {quote} I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...). And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access. {quote} Yeah I like that idea. I've done something similar for representing trees - I had a very compact Node class with no data but such a dense ID, and arrays that stored the associated data. Very easy to add another data type with no RAM overhead (you only use the amount of RAM the data needs). Though, the price you pay is for dereferencing multiple times for each array? And how much RAM would we safe? The pointer for the PostingList object (4-8 bytes), plus the size of the object header - how much is that in Java? Seems ilke it's 8 bytes: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845391#action_12845391 ] Michael Busch commented on LUCENE-2293: --- I'll reply on LUCENE-2324. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2293.patch > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Welcome new committers!
Welcome guys! :) Sounds really like some great progress in such a short time! Michael On 3/15/10 8:25 AM, Michael McCandless wrote: The merge of Solr and Lucene dev is well underway... Lucene already has a bunch of new committers... welcome aboard! And overnight tons of work was done (and beer, espresso and tea, depending on your timezone, consumed ;) and now we already have a branch where Solr has been upgraded to Lucene's trunk JARs: https://svn.apache.org/repos/asf/lucene/solr/branches/solr Wonderfully, this then made testing the flex branch against Solr simple, which Robert did, thus uncovering a couple back-compat issues that otherwise would've remained hidden... Great progress already! Of course there's still much to do going forward...devil is in the details, but it's great to have us all on the same team. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2311) Pass potent SR to IRWarmer.warm(), and also call warm() for new segments
[ https://issues.apache.org/jira/browse/LUCENE-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-2311: -- Assignee: Michael McCandless > Pass potent SR to IRWarmer.warm(), and also call warm() for new segments > > > Key: LUCENE-2311 > URL: https://issues.apache.org/jira/browse/LUCENE-2311 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Earwin Burrfoot >Assignee: Michael McCandless > Fix For: 3.1 > > > Currently warm() receives a SegmentReader without terms index and docstores. > It would be arguably more useful for the app to receive a fully loaded > reader, so it can actually fire up some caches. If the warmer is undefined on > IW, we probably leave things as they are. > It is also arguably more concise and clear to call warm() on all newly > created segments, so there is a single point of warming readers in NRT > context, and every subreader coming from getReader is guaranteed to be warmed > up -> you don't have to introduce even more mess in your code by rechecking > it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845374#action_12845374 ] Jason Rutherglen commented on LUCENE-2312: -- {quote}Good question on skipping - for first cut we can have no skipping (and just scan)? {quote} True. One immediate thought is to have a set skip interval (what was it before when we had single level?), and for now at least have a single level skip list. That we can grow the posting list with docs, and the skip list at the same time. If the interval is constant there won't be a need to rebuild the skip list. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845370#action_12845370 ] Michael McCandless commented on LUCENE-2320: bq. Or, maybe, we should think of MergePolicy API that doesn't require one to keep a reference to IW? Looks like IW is used pretty widely: for messaging (when infoStream is set), for retrieving the merges, for getting the Directory, and for getting number of deleted docs for a given segment. I guess an option would be to simply pass it around everywhere. Then we wouldn't have to break the circular dependendy. This is what MergeScheduler appears to do -- it's passed to .merge, and then each bg thread in CMS holds a reference to the writer (since it needs to ask for followon merges). > Add MergePolicy to IndexWriterConfig > > > Key: LUCENE-2320 > URL: https://issues.apache.org/jira/browse/LUCENE-2320 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2320.patch > > > Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as > well. The change is not straightforward and so I've kept it for a separate > issue. MergePolicy requires in its ctor an IndexWriter, however none can be > passed to it before an IndexWriter actually exists. And today IW may create > an MP just for it to be overridden by the application one line afterwards. I > don't want to make iw member of MP non-final, or settable by extending > classes, however it needs to remain protected so they can access it directly. > So the proposed changes are: > * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set > once (hence its name). It'll have the signature SetOnce w/ *synchronized > set* and *T get()*. T will be declared volatile, so that get() won't be > synchronized. > * MP will define a *protected final SetOnce writer* instead of > the current writer. *NOTE: this is a bw break*. any suggestions are welcomed. > * MP will offer a public default ctor, together with a set(IndexWriter). > * IndexWriter will set itself on MP using set(this). Note that if set will be > called more than once, it will throw an exception (AlreadySetException - or > does someone have a better suggestion, preferably an already existing Java > exception?). > That's the core idea. I'd like to post a patch soon, so I'd appreciate your > review and proposals. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Welcome new committers!
The merge of Solr and Lucene dev is well underway... Lucene already has a bunch of new committers... welcome aboard! And overnight tons of work was done (and beer, espresso and tea, depending on your timezone, consumed ;) and now we already have a branch where Solr has been upgraded to Lucene's trunk JARs: https://svn.apache.org/repos/asf/lucene/solr/branches/solr Wonderfully, this then made testing the flex branch against Solr simple, which Robert did, thus uncovering a couple back-compat issues that otherwise would've remained hidden... Great progress already! Of course there's still much to do going forward...devil is in the details, but it's great to have us all on the same team. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2297) IndexWriter should let you optionally enable reader pooling
[ https://issues.apache.org/jira/browse/LUCENE-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2297: --- Attachment: LUCENE-2297.patch Adds IWC.set/getReaderPooling. > IndexWriter should let you optionally enable reader pooling > --- > > Key: LUCENE-2297 > URL: https://issues.apache.org/jira/browse/LUCENE-2297 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2297.patch > > > For apps using a large index and frequently need to commit and resolve > deletes, the cost of opening the SegmentReaders on demand for every commit > can be prohibitive. > We an already pool readers (NRT does so), but, we only turn it on if NRT > readers are in use. > We should allow separate control. > We should do this after LUCENE-2294. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2320) Add MergePolicy to IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-2320: -- Assignee: Michael McCandless > Add MergePolicy to IndexWriterConfig > > > Key: LUCENE-2320 > URL: https://issues.apache.org/jira/browse/LUCENE-2320 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2320.patch > > > Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as > well. The change is not straightforward and so I've kept it for a separate > issue. MergePolicy requires in its ctor an IndexWriter, however none can be > passed to it before an IndexWriter actually exists. And today IW may create > an MP just for it to be overridden by the application one line afterwards. I > don't want to make iw member of MP non-final, or settable by extending > classes, however it needs to remain protected so they can access it directly. > So the proposed changes are: > * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set > once (hence its name). It'll have the signature SetOnce w/ *synchronized > set* and *T get()*. T will be declared volatile, so that get() won't be > synchronized. > * MP will define a *protected final SetOnce writer* instead of > the current writer. *NOTE: this is a bw break*. any suggestions are welcomed. > * MP will offer a public default ctor, together with a set(IndexWriter). > * IndexWriter will set itself on MP using set(this). Note that if set will be > called more than once, it will throw an exception (AlreadySetException - or > does someone have a better suggestion, preferably an already existing Java > exception?). > That's the core idea. I'd like to post a patch soon, so I'd appreciate your > review and proposals. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2325) investigate solr test failures using flex
[ https://issues.apache.org/jira/browse/LUCENE-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845332#action_12845332 ] Michael McCandless commented on LUCENE-2325: So awesome that we are at the point where we can do this! Thanks Robert... > investigate solr test failures using flex > - > > Key: LUCENE-2325 > URL: https://issues.apache.org/jira/browse/LUCENE-2325 > Project: Lucene - Java > Issue Type: Test >Affects Versions: Flex Branch >Reporter: Robert Muir >Assignee: Michael McCandless > Fix For: Flex Branch > > Attachments: LUCENE-2325.patch > > > We have a branch of Solr located here: > https://svn.apache.org/repos/asf/lucene/solr/branches/solr > Currently all the tests pass with lucene trunk jars. > I plopped in the flex jars and they do not, so I thought these might be > interesting to look at. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2325) investigate solr test failures using flex
[ https://issues.apache.org/jira/browse/LUCENE-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-2325: -- Assignee: Michael McCandless > investigate solr test failures using flex > - > > Key: LUCENE-2325 > URL: https://issues.apache.org/jira/browse/LUCENE-2325 > Project: Lucene - Java > Issue Type: Test >Affects Versions: Flex Branch >Reporter: Robert Muir >Assignee: Michael McCandless > Fix For: Flex Branch > > Attachments: LUCENE-2325.patch > > > We have a branch of Solr located here: > https://svn.apache.org/repos/asf/lucene/solr/branches/solr > Currently all the tests pass with lucene trunk jars. > I plopped in the flex jars and they do not, so I thought these might be > interesting to look at. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: How can I use QueryScorer() to find only perfect matches??
Try +contents:term +contents:query. By misplacing the '+' you're getting the default OR operator and the '+' is probably being thrown away by the analyzer. Luke will help here a lot. HTH Erick On Mon, Mar 15, 2010 at 9:46 AM, christian stadler wrote: > Hi there, > > I have an issue with the QueryScorer(query) method at the moment and I need > some assistance. > I was indexing my e-book "lucene in action" and based on this index-db I > started to play around with some boolean queries like: > (contents:+term contents:+query) > As a result I'm expecting as a perfect match for the phrase "term query" > four > hits. > > But when I run my sample to highlight this phrase in the context then I get > a > lot more results. It also finds all the matches for "term" and "query" > independently. > > I think the problem is the QueryScorer() which softens the former exact > boolean > query. > Then I was trying the following: > private static Highlighter GetHits(Query query, Formatter formatter) > { >string filed = "contents" >BooleanQuery termsQuery = new BooleanQuery(); > >WeightedTerm[] terms = QueryTermExtractor.GetTerms(query, true, field); >foreach (WeightedTerm term in terms) >{ >TermQuery termQuery = new TermQuery(new Term(field, > term.GetTerm())); >termsQuery.Add(termQuery, BooleanClause.Occur.MUST); >} > >// create query scorer based on term queries (field specific) >QueryScorer scorer = new QueryScorer(termsQuery); > >Highlighter highlighter = new Highlighter(formatter, scorer); >highlighter.SetTextFragmenter(new SimpleFragmenter(20)); > >return highlighter; > } > to rewrite the query and set the term attribute from SHOULD to MUST > > But the result was the same. > Do you have any example how I can use the QueryScorer() in exactly the same > way > as to mimic a BooleanSearch?? > > thanks in advance > Christian > > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >
[jira] Resolved: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2293. Resolution: Fixed > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2293.patch > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2325) investigate solr test failures using flex
[ https://issues.apache.org/jira/browse/LUCENE-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2325: Attachment: LUCENE-2325.patch attached is a very small patch to the Solr branch so it will compile against flex jars. > investigate solr test failures using flex > - > > Key: LUCENE-2325 > URL: https://issues.apache.org/jira/browse/LUCENE-2325 > Project: Lucene - Java > Issue Type: Test >Affects Versions: Flex Branch >Reporter: Robert Muir > Fix For: Flex Branch > > Attachments: LUCENE-2325.patch > > > We have a branch of Solr located here: > https://svn.apache.org/repos/asf/lucene/solr/branches/solr > Currently all the tests pass with lucene trunk jars. > I plopped in the flex jars and they do not, so I thought these might be > interesting to look at. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2325) investigate solr test failures using flex
investigate solr test failures using flex - Key: LUCENE-2325 URL: https://issues.apache.org/jira/browse/LUCENE-2325 Project: Lucene - Java Issue Type: Test Affects Versions: Flex Branch Reporter: Robert Muir Fix For: Flex Branch Attachments: LUCENE-2325.patch We have a branch of Solr located here: https://svn.apache.org/repos/asf/lucene/solr/branches/solr Currently all the tests pass with lucene trunk jars. I plopped in the flex jars and they do not, so I thought these might be interesting to look at. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
How can I use QueryScorer() to find only perfect matches??
Hi there, I have an issue with the QueryScorer(query) method at the moment and I need some assistance. I was indexing my e-book "lucene in action" and based on this index-db I started to play around with some boolean queries like: (contents:+term contents:+query) As a result I'm expecting as a perfect match for the phrase "term query" four hits. But when I run my sample to highlight this phrase in the context then I get a lot more results. It also finds all the matches for "term" and "query" independently. I think the problem is the QueryScorer() which softens the former exact boolean query. Then I was trying the following: private static Highlighter GetHits(Query query, Formatter formatter) { string filed = "contents" BooleanQuery termsQuery = new BooleanQuery(); WeightedTerm[] terms = QueryTermExtractor.GetTerms(query, true, field); foreach (WeightedTerm term in terms) { TermQuery termQuery = new TermQuery(new Term(field, term.GetTerm())); termsQuery.Add(termQuery, BooleanClause.Occur.MUST); } // create query scorer based on term queries (field specific) QueryScorer scorer = new QueryScorer(termsQuery); Highlighter highlighter = new Highlighter(formatter, scorer); highlighter.SetTextFragmenter(new SimpleFragmenter(20)); return highlighter; } to rewrite the query and set the term attribute from SHOULD to MUST But the result was the same. Do you have any example how I can use the QueryScorer() in exactly the same way as to mimic a BooleanSearch?? thanks in advance Christian - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
>>> But I don't like baking in search concepts at index time... >> > Many scoring models are possible if you store enough stats in the > index. > in general the missing stats seem to fit in two buckets/categories: 1) length normalization pivot: average length in bytes, terms, unique terms 2) term frequency normalization factor: max or average tf for the field. you never need more than one of each category for the same field. one approach would be for the search-time similarity to simply use these generic names (i guess they could get some placeholder value if they are not available) and at index time, you make sure you put the one you want (or none at all) in the "bucket" -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [DISCUSS] Do away with Contrib Committers and make core committers
On Mar 14, 2010, at 6:47 PM, Mark Miller wrote: > > > On 03/14/2010 06:37 PM, Grant Ingersoll wrote: >> On Mar 14, 2010, at 2:03 PM, Uwe Schindler wrote: >> >> >>> This time a +1 without discuss :-) >>> >> Yeah, but Uwe, the thread was DISCUSS, not VOTE! :-) >> > > I had a whole spiel about earning merit, and some contrib committers were > made contrib committers for just a single contrib, some long ago, didn't have > to necessarily show they understood/followed the apache way, lower bar (not > necessarily from talent perspective, but you might be made a contrib > committer just to maintain the code module you contributed, whether you > worked with the community or not), etc, etc. But ah, since everyone is into > it without discussion, far be it from me to stand against. And I got my spiel > in (super condensed) anyway now. With everyone else into it so far, I just > look foolish trying to discuss :) Right, Mark. I think we would be effectively raising the bar to some extent for what it takes to be a committer. We'd also be making contrib a first class citizen (not that it ever wasn't, but some people have that perception). Finally, I think we need to recognize that not everyone needs to be a McCandless in order to contribute in a helpful way. I think sometimes we forget that you can do svn revert. Obviously, we don't want to have to do it often, but it's not a huge deal if it happens. We've all been there. -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [DISCUSS] Do away with Contrib Committers and make core committers
On Mar 14, 2010, at 8:25 PM, Yonik Seeley wrote: > On Sun, Mar 14, 2010 at 5:47 PM, Mark Miller wrote: >> On 03/14/2010 06:37 PM, Grant Ingersoll wrote: >>> >>> On Mar 14, 2010, at 2:03 PM, Uwe Schindler wrote: >>> >>> This time a +1 without discuss :-) >>> >>> Yeah, but Uwe, the thread was DISCUSS, not VOTE! :-) >>> >> >> I had a whole spiel about earning merit, and some contrib committers were >> made contrib committers for just a single contrib, some long ago, didn't >> have to necessarily show they understood/followed the apache way, lower bar >> (not necessarily from talent perspective, but you might be made a contrib >> committer just to maintain the code module you contributed, whether you >> worked with the community or not), etc, etc. > > Hmmm, yeah - when it is time to VOTE, there are actually two different > questions here: Agreed. > 1) if lucene should move away from contrib committers, adding no new ones Yes, this is what I'm thinking. All future committers would be based on contributions to the project and there would be no distinction between contrib/core. > 2) if all existing contrib committers should immediately become core > lucene/solr committers, or if that promotion should proceed in the > normal fashion as it has in the past. I'm fine w/ all of them, except we might want to check to see if it has been more than a year of contributing and ask any of them if they want to be Emeritus. -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Mon, Mar 15, 2010 at 12:03 AM, Marvin Humphrey wrote: > On Sat, Mar 13, 2010 at 06:41:26AM -0500, Michael McCandless wrote: > >> I still don't think similarity should have any bearing during indexing. > > Similarity has always, from day one, affected the contents of the index. This > idea that it should be totally divorced from indexing is, in fact, a very > significant change that you are proposing for Lucene, and it will require > non-trivial changes to the file format. I agree. Instead of storing byte per doc I'm proposing storing the raw stats and letting Sim compute that byte at search time. We can also allow that Sim to cache stuff (boost bytes, if it uses them) to make startup faster, eventually. > For starters, you're going to at least double the footprint of the norms. For > fields with more than 127 tokens or 127 unique terms, the increase will be > greater... and if the user sets doc-boost and field-boost in a pattern that > defies RLE compression, the footprint will be greater still. On disk, yes. In memory, no (assuming your Sim impl encodes boost as byte). > I happen to think that limited search-time settability of Similarity offers a > nice feature -- the ability to futz with different weighting models and length > normalization settings without reindexing -- and that it's worth exploring in > pursuit of this feature. > > But by opting to forego the lossy compression now performed by encodeNorm() at > index-time and store precursor statistics instead, we are going to take a hit > on index size even with lossless compression. I think it's worth letting the custom Sim cache stuff [privately] on disk, ie the byte norms, eventually. > Furthermore, delaying Similarity choice means that it becomes the user's > responsibility to ensure that index-time Codec choice is compatible with > search-time Similarity choice. In contrast, setting Similarity at index-time > means that the core gets to pick the Codec and can ensure that all the > necessary data gets encoded, sparing the user from having to understand the > gory details of posting formats. Yeah this is the part I struggle with -- how to make index-time field options "intelligible". But I think good defaulting does 90% of the work. The remaining 10% can work backwards from their search needs to what must be done at indexing. > In summary, I think search-time setting of Similarity is a nice feature but a > poor requirement. I'm not persuaded that this proposal to banish Similarity > from index-time is wise. OK I think we just differ... >> But I don't like baking in search concepts at index time... > > Then you ought to use a traditional RDBMS rather than an indexing engine, and > make sure you don't put indexes on any of the fields in your tables. :) > > Or maybe an RDBMS has too many search concepts baked in, and a flat file would > be best. :) > > Seriously... optimizing on-disk data structures to accommodate anticipated > search query patterns and maximize speed and relevance... that's what > indexing's all about, ain't it? You're over-reading into what I said. I mean specifically one should not have to commit to the precise scoring model they will use for a given field, when they index that field. Many scoring models are possible if you store enough stats in the index. > And what class other than Similarity knows enough about the scoring algorithm > to perform these data reduction tasks? If it's not goint to be Similarity > itself, it has to be something that know absolutely everything about the > Similarity implementation's scoring model. I don't follow this... It will be Sim that does computes norm bytes. I mean, other classes can go and look @ these stats if they want, too... users will come up with neat uses over time :) >> > Right. However, now that I've thought about it, if a user indicates that a >> > field is "match-only" by supplying a MatchSimilarity, we know that we can >> > omit boost bytes. >> > >> > So we can re-conceive "MatchSimilarity" as being analogous to omitNorms. >> > Huzzah! >> > >> > One down, one to go. :) >> >> Hmm except shouldn't you allow omitting boost bytes but keeping term >> freqs? Ie all docs are roughly the same length (say, a title field) >> and I never boost them? How will you allow this? > > I think that you've described an uncommon use case, and it's tempting to just > wave it off with the easy answer: you spec a Sim that writes such a format. I don't think this is so uncommon? (This is the omitNorms case in Lucene today, except you still gotta index positions, until we decouple the two = LUCENE-2048. Such a nice round binary number for remembering...). > But here's where maybe Lucy can steal from the Lucene flex branch. Yay: poaching! > We can give Similarity a makePostingCodec() factory method. Then, > we can publish common PostingCodecs as public classes, allowing us > to support different formats with minimal effort. > > class MySim extends Similarity { >
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845263#action_12845263 ] Michael McCandless commented on LUCENE-2293: bq. For example, currently a nice optimization would be to store the first posting in the PostingList object and only allocate slices once you see the second occurrence (similar to the pulsing codec)? I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed). EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting. bq. What exactly do you mean with parallel arrays? Parallel to the termHash array? Then the termsHash array would not be an array of PostingList objects anymore, but an array of pointers into the char[] array? And you'd have e.g. a parallel int[] array for df, another int[] for pointers into the postings byte pool, etc? Something like that? I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...). And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2293.patch > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845261#action_12845261 ] Michael McCandless commented on LUCENE-2324: Sounds great -- let's test it in practice. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845257#action_12845257 ] Michael McCandless commented on LUCENE-2312: {quote} I got the basics of the term enum working, it can be completed fairly easily. So I moved on to term docs... There we got some work to do? Because we're not storing the skip lists in the ram buffer, currently. I guess we'll need a new FreqProxTermsWriterPerField that stores the skip lists as they're being written? How will that work? Doesn't the multi-level skip list assume a set number of docs? {quote} Sounds like you & Michael should sync up! Good question on skipping -- for first cut we can have no skipping (and just scan)? Skipping may not be that important in practice, unless RAM buffer becomes truly immense. Of course, the tinier the docs the more important skipping will be... > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845255#action_12845255 ] Michael McCandless commented on LUCENE-2312: Yes, commit should flush & sync all doc writers, and rollback must abort all of them. bq. I also have a separate indexing chain prototype working with searchable RAM buffer (single-threaded) Yay! bq. but slightly different postinglist format (some docs nowadays only have 140 characters ). New sponsor, eh? ;) But, yes, I suspect an indexer chain optimized to tiny docs can get sizable gains. What change to the postings format? Is the change only in the RAM buffer or also in the index? If it's in the index... we should probably do this under flex. bq. It seems really fast. I spent a long time thinking about lock-free algorithms and data structures, so indexing performance should be completely independent of the search load (in theory). I need to think a bit more about how to make it work with "normal" documents and Lucene's current in-memory format. Sounds like awesome progress!! Want some details over here :) > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2322) Remove verbosity from tests and make configureable
[ https://issues.apache.org/jira/browse/LUCENE-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-2322. --- Resolution: Fixed Committed revision: 923112 > Remove verbosity from tests and make configureable > -- > > Key: LUCENE-2322 > URL: https://issues.apache.org/jira/browse/LUCENE-2322 > Project: Lucene - Java > Issue Type: Sub-task >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.1 > > Attachments: LUCENE-2322-surround.patch, LUCENE-2322-surround.patch, > LUCENE-2322.patch > > > The parent issue added the functionality to LuceneTestCase(J4), this patch > applies it to most tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org